ThirdAI’s Universal Deep Transformers AutoML Interface, powered by our proprietary BOLT deep learning framework, provides substantial reductions in energy consumption compared to popular pre-trained NLP models without reducing model quality
**This post was written by Benjamin Meisburger, former intern at ThirdAI**
As sustainability becomes an increasingly critical requirement for organizations across all business sectors, reducing the cost and energy consumption of training and deploying large-scale AI models has emerged as an essential task. In the case of GPT-3, for instance, the electricity and compute cost of training alone was reported to be $12 million. This concern has only intensified in recent months as model sizes continue to balloon.
In this post, we study how ThirdAI’s BOLT engine, the framework behind our Universal Deep Transformers AutoML library, translates into carbon savings. In short, we find that BOLT’s ability to train sparse neural networks on everyday CPU hardware yields significant energy savings, producing only 2.5% of the carbon emissions associated with fine-tuning a RoBERTa model on a sentiment analysis task.
To standardize our testing as much as possible, all experiments were run on AWS instances in the us-west-1 availability zone, providing a replicable and consistent framework. First, we fine-tuned an off-the-shelf pre-trained RoBERTa model on a p4d.24xlarge instance, which reached ~93% test accuracy in 40 minutes on a single A100 GPU (a p4d.24xlarge instance includes eight A100s). We used 93% accuracy as our threshold for all subsequent models. Next, we trained BOLT on an r6g.xlarge CPU instance from scratch with 20% sparsity (meaning the neural network used only 20% of the full dense matrix computations) that reached 93% accuracy in 42 minutes. We repeated this experiment with 10% sparsity, resulting in a training time of just over 20 minutes. Finally, we trained BOLT with 5% sparsity; however, this particular model only reached 90% accuracy, yielding a more pronounced tradeoff between model quality and energy consumption.
For example, when estimating the emissions from fine-tuning RoBERTa on a single A100 within a p4d.24xlarge instance:
This process was repeated for each level of sparsity, compiled in the figure below. All BOLT networks were trained from scratch on a single r6g.xlarge instance while RoBERTa was fine-tuned on a single A100 GPU.
As shown above, the BOLT engine provides a significant improvement over a state-of-the-art NLP model in regards to carbon footprint; on average producing just 2.5% the carbon emissions of RoBERTa fine-tuned on a GPU. We also note that this comparison does not take into account the pre-training cost of RoBERTa, which is also substantial. For BOLT, it should be noted that increasing sparsity yields a significant — but not quite linear — carbon savings, although the emissions savings begin to plateau as the total carbon footprint approaches the (unavoidable) cost of manufacturing a CPU. It should be noted that these figures are estimates only, but we are confident they are conservative enough to capture procedural error and thus should be representative of the true footprint.