Announcement: Checkout out our new BOLT2.5B LLM – World’s first Generative LLM trained exclusively on CPUs

Announcing BOLT: A Deep Learning Engine for Efficiently Training and Deploying Large Models on Commodity CPUs

New paper dives into the technical details of BOLT, the first deep learning framework for efficiently training and deploying massive models on CPUs
Artificial intelligence is in the midst of a revolution with extremely large-scale models demonstrating astounding capabilities that have captured the attention and imagination of the general public. However, the current process of training and deploying these massive neural networks remains unsustainable as well as prohibitively expensive for all but the largest of institutions due to the need for expensive specialized hardware such as GPUs.
To democratize large-scale deep learning for all in a sustainable manner, we at ThirdAI rejected the widely held assumption that specialized hardware accelerators are essential for training and serving large models. To achieve this vision, we have built a new deep learning framework from scratch, called BOLT, that enables developers to train billion-parameter models, as well as deploy them, on ordinary low-cost CPU machines.
Today, we are excited to announce the release of a new preprint detailing the technical details of BOLT, including 1) the design philosophy, 2) novel algorithmic features, and 3) experimental results on a variety of machine learning benchmarks. We hope that this paper will bring more awareness of the potential for achieving near state-of-the-art deep learning performance using only CPU machines while enjoying, in some cases, orders-of-magnitude speedups in training and inference time.

Paper Highlights

Novel Features for a Deep Learning Framework: We introduce several development features that, to our knowledge, are unique amongst production-grade deep learning frameworks. First and foremost is the concept of configurable sparsity which allows developers to set the amount of computation to perform in a given layer (e.g. activating only 5% of the neurons in a forward pass). Our sparsity parameter provides the capability for users to gracefully trade between training time and model quality.
Finally, we introduce the notion of dynamic sparse inference for model deployment that further reduces inference latency by activating only a subset of neurons for a given input. As shown in the table below (taken from the paper), we discovered that sparse inference can, perhaps surprisingly, achieve essentially the same accuracy as a dense TensorFlow model with an order of magnitude reduction in latency.

State-of-the-Art Performance on the Yelp-Chi Graph Learning Benchmark

Deep learning on graphs has recently emerged as an area of intense interest with numerous applications ranging from fraud detection, to recommendations, and to social network analysis. In this paper, we focus on evaluating BOLT on the Non-Homophilous Graph Benchmarks which consist of node classification tasks on large-scale graphs where neighboring vertices are not necessarily of the same type. This setting is generally harder than the homophilous case, especially since many popular graph learning methods seek to use homophily as an inductive bias. We evaluate on three datasets from this benchmark and, as shown in the table below, achieve state-of-the-art performance on the Yelp-Chi dataset for fraud detection, beating a very strong baseline in LinkX.

Distributed Data-Parallel Training Optimized for CPUs

With many modern machine learning datasets on the terabyte-scale and beyond, we knew that BOLT needed to support distributed training. Distributed deep learning approaches very roughly fall into two categories. One paradigm is data-parallel training where each machine loads a copy of the model, trains on separate shards of data, and communicates gradients between nodes to update parameters. The second approach is model-parallel training where the layers of the model are partitioned amongst multiple machines. Since CPU machines have much larger memory budgets than GPUs, we realized that we could fit even billion-parameter models entirely in CPU memory. Thus, we can avoid model-parallel training, which often comes with greater engineering complexity, and focus on data-parallel. We implemented our distributed BOLT library on top of Ray, an open source framework for parallel computing.
Because our BOLT models are heavily optimized for speed, we found that we were heavily bottlenecked during training by the cost of communicating gradients between machines and we thus had to design new approaches for data parallel training, including a novel gradient compression scheme, to achieve better scalability as we increased the number of nodes.

Even More!

We include additional experiments and scientific contributions in the paper, including self-supervised pretraining, personalized recommendations, and extreme classification. Please check out the paper for more details.

Try Out BOLT

To try out BOLT for yourself, please see our suite of demo notebooks available on both Google Colab and Jupyter. These notebooks include examples of many of the experiments described in this post such as graph learning, distributed training, and recommendations. To learn more about ThirdAI for your business needs, please visit our website.