Case Study: GPUs Or Old CPUs

Training A 1.6 Billion Parameter Model On A CPU

GPT-3

The AI research community has recently seen exciting developments in training of larger and larger models. OpenAI released GPT-3, a model with 175 billion parameters, to much fanfare, and Google Brain recently trained a model with 1 trillion parameters. Such huge models, however, require ever more computation, and are trained on huge clusters of GPUs for months at a time.

In the case of GPT-3, for instance, the electricity cost of training alone was reported to be 12 million dollars. These sorts of resources are infeasible for all but the largest companies, leaving research and development along this path out of reach for almost everyone.
Enter BOLT

Here is an image of the BLADE server we used to train a model with billions of parameters. Yes, that’s a refurbished BLADE cluster with 12 old (v3) CPUs. Armed with our BOLT Engine, we can train recommendation models with billions of parameters and better performance than a top-of-the-line A100 GPU, all on a single CPU node that costs less than $1200. See our detailed performance benchmarking of BOLT Engine against TensorFlow on CPU and GPU here.

Simple fully connected networks are the popular models of choice in recommendation systems. For example, the ubiquitous Siamese network and Amazon’s DSSM model for product search are both fully connected networks. Even classic DLRM model is shown to be equivalent to a fully connected network with sparse features.

However, these fully connected networks come with a big problem: memory. In recommendation systems, the bigger the network size, the better the performance. As it turns out, larger models require larger batch sizes for better generalization and faster training. Both large models and large batches require a substantial working memory, but GPU memory is extremely limited. Thus, GPU memory constitutes a bottleneck which holds back the exploration of larger models.
The Kaggle Amazon 670 Dataset

Memory bottlenecking is apparent in the performance of GPUs on popular extreme classification problems, like the Kaggle Amazon 670 Dataset, a product recommendation dataset. Most papers published on that dataset come to a premature conclusion that simple neural networks are not competitive. For this purpose, a simple neural network is a 1-hidden-layer fully connected architecture with a hidden layer size of 128. This yields just 32% precision@1 (usual accuracy measure), which is just on par with a simple near neighbor search.

State-of-the-art competitive models get more than 40% precision@1 on this Kaggle Datasets. Note that this is a dataset with 670,000 possible outcomes — a 40% precision@1 is a reasonably good accuracy. While Amazon-670K is a limited public benchmark and industry-scale datasets are much bigger in terms of samples, performance on this dataset is likely representative of performance on real world problems.
Our BOLT Engine makes training large neural networks with large batch size easy
It turns out we can quickly get 40% accuracy with a single layer neural network if we increase the hidden layer size to 2000 and use a large batch size of 2048. Increasing batch size is an intriguing new approach and is often observed to improve generalization for many dataset. The Kaggle Dataset also require large batch size for good accuracy, also noted in this MLSys 2021 paper. However, we could not possibly run such a large batch size and hidden layer configuration on an A100 GPU. The dataset has feature dimension 135,909. So with our 2000 length hidden layer and 670,091 length output layer, our simple one-layer neural network has (2000 x (135,909 + 670,091)) parameters — that’s over 1.6 billion! Even if we scale back to a batch size of 256, we would get an out of memory error on top-notch NVIDIA A100 GPUs.
The Kaggle Amazon 670 Dataset

But using BOLT, we can train 1.6 billion parameter model with large batch size of 2048 on an old CPU box. With BOLT on CPUs, training time is significantly faster and the model easily surpasses 40% accuracy. In 10 epochs, BOLT reaches 40.3% accuracy, with each epoch taking less than 2200 seconds.

To run 2048 batch size without crashing, we need at least four A100 GPUs. Even after leveraging four A100 GPUs, which cost more than $50,000, the speed is still worse than BOLT Engine running on an old desktop. More details here.
Training with the BOLT Engine is now a one-line code change in existing TensorFlow python pipelines
Want to try our system out? It’s just a one-line code change in an existing pipeline to switch the TensorFlow engine with the BOLT engine. Check out how to get access and use BOLT in your projects here.
The True Democratization of AI
ThirdAI’s BOLT enables everyone to explore larger deep learning models with substantial batch size. Such models can overwhelm even the state-of-the-art A100 GPUs found only in data centers and supercomputers, but with BOLT, they can easily be trained on any old commodity CPU with a supply of cheap RAM. We are making every computing unit count, from the dusty old boards professionals used to write off, to the hand-me-down desktop of an enthusiastic undergrad. That is indeed the true democratization of AI. BOLT is blazing a trail into the future called for in this New York Times article: simply improving FLOPS has hit a bottleneck; what AI needs now is more human creativity.
State-of-the-art competitive models get more than 40% precision@1 on this Kaggle Datasets. Note that this is a dataset with 670,000 possible outcomes — a 40% precision@1 is a reasonably good accuracy. While Amazon-670K is a limited public benchmark and industry-scale datasets are much bigger in terms of samples, performance on this dataset is likely representative of performance on real world problems.