Case Study: GPUs Or Old CPUs
Training A 1.6 Billion Parameter Model On A CPU
GPT-3
The AI research community has recently seen exciting developments in training of larger and larger models. OpenAI released GPT-3, a model with 175 billion parameters, to much fanfare, and Google Brain recently trained a model with 1 trillion parameters. Such huge models, however, require ever more computation, and are trained on huge clusters of GPUs for months at a time.


Enter BOLT
Here is an image of the BLADE server we used to train a model with billions of parameters. Yes, that’s a refurbished BLADE cluster with 12 old (v3) CPUs. Armed with our BOLT Engine, we can train recommendation models with billions of parameters and better performance than a top-of-the-line A100 GPU, all on a single CPU node that costs less than $1200. See our detailed performance benchmarking of BOLT Engine against TensorFlow on CPU and GPU here.
Simple fully connected networks are the popular models of choice in recommendation systems. For example, the ubiquitous Siamese network and Amazon’s DSSM model for product search are both fully connected networks. Even classic DLRM model is shown to be equivalent to a fully connected network with sparse features.
The Kaggle Amazon 670 Dataset
Memory bottlenecking is apparent in the performance of GPUs on popular extreme classification problems, like the Kaggle Amazon 670 Dataset, a product recommendation dataset. Most papers published on that dataset come to a premature conclusion that simple neural networks are not competitive. For this purpose, a simple neural network is a 1-hidden-layer fully connected architecture with a hidden layer size of 128. This yields just 32% precision@1 (usual accuracy measure), which is just on par with a simple near neighbor search.


Our BOLT Engine makes training large neural networks with large batch size easy
The Kaggle Amazon 670 Dataset
But using BOLT, we can train 1.6 billion parameter model with large batch size of 2048 on an old CPU box. With BOLT on CPUs, training time is significantly faster and the model easily surpasses 40% accuracy. In 10 epochs, BOLT reaches 40.3% accuracy, with each epoch taking less than 2200 seconds.


Training with the BOLT Engine is now a one-line code change in existing TensorFlow python pipelines
The True Democratization of AI
