New paper dives into the technical details of BOLT, the first deep learning framework for efficiently training and deploying massive models on CPUs
Artificial intelligence is in the midst of a revolution with extremely large-scale models demonstrating astounding capabilities that have captured the attention and imagination of the general public. However, the current process of training and deploying these massive neural networks remains unsustainable as well as prohibitively expensive for all but the largest of institutions due to the need for expensive specialized hardware such as GPUs.
To democratize large-scale deep learning for all in a sustainable manner, we at ThirdAI rejected the widely held assumption that specialized hardware accelerators are essential for training and serving large models. To achieve this vision, we have built a new deep learning framework from scratch, called BOLT, that enables developers to train billion-parameter models, as well as deploy them, on ordinary low-cost CPU machines.
Novel Features for a Deep Learning Framework: We introduce several development features that, to our knowledge, are unique amongst production-grade deep learning frameworks. First and foremost is the concept of configurable sparsity which allows developers to set the amount of computation to perform in a given layer (e.g. activating only 5% of the neurons in a forward pass). Our sparsity parameter provides the capability for users to gracefully trade between training time and model quality.
Finally, we introduce the notion of dynamic sparse inference for model deployment that further reduces inference latency by activating only a subset of neurons for a given input. As shown in the table below (taken from the paper), we discovered that sparse inference can, perhaps surprisingly, achieve essentially the same accuracy as a dense TensorFlow model with an order of magnitude reduction in latency.
With many modern machine learning datasets on the terabyte-scale and beyond, we knew that BOLT needed to support distributed training. Distributed deep learning approaches very roughly fall into two categories. One paradigm is data-parallel training where each machine loads a copy of the model, trains on separate shards of data, and communicates gradients between nodes to update parameters. The second approach is model-parallel training where the layers of the model are partitioned amongst multiple machines. Since CPU machines have much larger memory budgets than GPUs, we realized that we could fit even billion-parameter models entirely in CPU memory. Thus, we can avoid model-parallel training, which often comes with greater engineering complexity, and focus on data-parallel. We implemented our distributed BOLT library on top of Ray, an open source framework for parallel computing.
Because our BOLT models are heavily optimized for speed, we found that we were heavily bottlenecked during training by the cost of communicating gradients between machines and we thus had to design new approaches for data parallel training, including a novel gradient compression scheme, to achieve better scalability as we increased the number of nodes.
We include additional experiments and scientific contributions in the paper, including self-supervised pretraining, personalized recommendations, and extreme classification. Please check out the paper for more details.