Fully connected networks are ubiquitous in recommendation systems and NLP tasks. State of the art models like Facebook’s DLRM, Microsoft’s DSSM, and Amazon’s DSSM are all giant fully connected neural networks.
Training 200 Million Parameter Model
The smaller 200 million network figure shows that BOLT is faster than Tensorflow on all tested CPUs, whether the chips are Intel, AMD, or M1.
Training 1.6 Billion Parameter Model
On the other hand, the BOLT engine can effortlessly scale up to a large batch size of 2048 with no change in model memory. We can even train with a batch size of 10,000. Training this model using TensorFlow on CPUs is around 5x slower.
A billion parameter model with a batch size of 2048 requires a fleet of four A100s GPUs, each costing more than $10K. Notably, our BOLT Engine on a 24 core refurbished V3 intel machine that costs less than $1200 can train this giant model faster than the acceleration provided by four top-of-the-line A100s.
that the top-of-the-line A100 GPU with 48 GB memory can barely accommodate a batch size of 256 for our 1.6BN network. To run 2048 bath sizes, we need to distribute the model and training over 4 A100 GPUS (on a machine that costs around $70000). Even 2 A100s cannot accommodate the training.