Bolt

Detailed Comparisons

Fully connected networks are ubiquitous in recommendation systems and NLP tasks. State of the art models like Facebook’s DLRMMicrosoft’s DSSM, and Amazon’s DSSM are all giant fully connected neural networks.

To benchmark BOLT, we trained two networks, one with 200 million parameters and the other with 1.6 billion parameters. The aforementioned industry-scale recommendation models have 200 to 500 million parameters. Notably, even BERT-Large is in the same ballpark, with 340 million parameters to train. For a task representative of several industry-scale applications, we choose the Amazon 670K Kaggle dataset. For our two BOLT networks, one consisted of a 256-dimensional hidden embedding (200 million parameter model) and the other consisted of a 2000 dimensional hidden layer (1.6 billion parameter model).
Training 200 Million Parameter Model

The smaller 200 million network figure shows that BOLT is faster than Tensorflow on all tested CPUs, whether the chips are Intel, AMD, or M1.

Our other article explained that larger models need larger batch sizes for speed and generalization. However, GPUs have limited memory and cannot scale to the needs of a billion scale network. We notice that the top-of-the-line A100 GPU with 48 GB memory can barely accommodate a batch size of 256 for our 1.6BN network. On this giant model, a batch size of 2048, which gives better accuracy than batch size 256, goes out of memory on A100 GPU. Even distributing the model on two A100s is not enough. We need at least four A100 GPUs to train the 1.6 billion parameter model with a 2048 batch size.
Training 1.6 Billion Parameter Model

On the other hand, the BOLT engine can effortlessly scale up to a large batch size of 2048 with no change in model memory. We can even train with a batch size of 10,000. Training this model using TensorFlow on CPUs is around 5x slower.

We

A billion parameter model with a batch size of 2048 requires a fleet of four A100s GPUs, each costing more than $10K. Notably, our BOLT Engine on a 24 core refurbished V3 intel machine that costs less than $1200 can train this giant model faster than the acceleration provided by four top-of-the-line A100s.

that the top-of-the-line A100 GPU with 48 GB memory can barely accommodate a batch size of 256 for our 1.6BN network. To run 2048 bath sizes, we need to distribute the model and training over 4 A100 GPUS (on a machine that costs around $70000). Even 2 A100s cannot accommodate the training.

If you want to give BOLT a try, please fill out this form