Announcement: Checkout out our new BOLT2.5B LLM – World’s first Generative LLM trained exclusively on CPUs

Benchmarking ThirdAI’s BOLT Engine on AMD EPYC 9004 Series CPUs

Introduction:
ThirdAI is a startup dedicated to democratizing state-of-the-art machine learning capabilities through
algorithmic and software innovations that enable training and deploying large models on low-cost
CPU hardware. Over the past year, a number of customers across a broad swath of industries, have
successfully trained and deployed ThirdAI models for their business needs, often reporting both
improved model quality and reduced training and deployment costs from migrating their modeling
infrastructure to CPUs.

In addition to the proprietary algorithmic innovations at the heart of ThirdAI’s software, we have also
benefited from the continued advancements in CPU technology. In this report, we benchmark
ThirdAI’s BOLT deep learning engine on AMD’s latest-generation EPYC 9004 Series CPUs to
measure the benefits of combining our machine learning technology with state-of-the-art CPU hardware.

In short, we find that the AMD EPYC 9004 Series CPUs dramatically accelerates BOLT training
across a variety of representative machine learning tasks, which further fuels our optimism in building
a future where deep learning on general-purpose CPUs becomes the standard.

Experiment 1: Graph Node Classification:

AMD Blog: Benchmarking ThirdAI’s BOLT Engine on AMD EPYC 9004 Series CPUs 2
For our first benchmarking experiment, we explore the domain of machine learning on graphs. In
particular, we focus on the task of classifying nodes in the network given a set of features. For this
experiment, we evaluate on the Yelp-Chi fraud detection dataset and the Pokec social network.

In the first table below, we report the test accuracy of our BOLT model compared to two well-
established graph neural network baselines, Graph Convolutional Networks (GCN) and Graph

Attention Networks (GAT). We highlight that BOLT in fact achieves worldwide state-of-the-art
performance on the Yelp-Chi dataset. In Table 2, we report the training time for all models across the
AMD EPYC 9004 Series CPUs and an NVIDIA A100 GPU.

ModelYelp (ROC-AUC)Pokec (P@1)
ThirdAI’s BOLT91.1 (93.1 with longer training)78.0
Graph Convolutional Network (GCN)63.6275.45
Graph Attention Network (GAT)81.4271.77
Table 1: Test Set Performance

 

ModelHardwareYelpPokec
ThirdAI’s BOLT4th Gen AMD EPYC 9754 CPU10s (1s per epoch)439s (44s per epoch)
ThirdAI’s BOLT4th Gen AMD EPYC 9654 CPU10s (0.9s per epoch)340.9s (38.1s per epoch)
Graph Convolution (GCN)4th Gen AMD EPYC 9754 CPU150s (6s poer epoch)600s (24s per epoch)
Graph Convolution (GCN)4th Gen AMD EPYC 9654 CPU150s (6s per epoch)537.5s (21.5s per epoch)
Graph Attention (GAT)4th Gen AMD EPYC 9754 CPU200s (20s per epoch)1340 (134s per epoch)
Graph Attention (GAT)4th Gen AMD EPYC 9654 CPU230s (23s per epoch)1004s (110s per epoch)
Graph Convolution (GCN)NVIDIA A1004.51s (0.009s per epoch)22.86s (0.048s per epoch)
Graph Attention (GAT)NVIDIA A10022.5s (0.05s per epoch)59.3s (0.12s per epoch)
Table 2: Training Time

 

Experiment 2: LSTM Benchmarks:
In our second experiment, we conduct a benchmark of sequence-to-sequence (Seq2Seq) modeling
tasks, which involve generating an output sequence, such as text, in response to a given input
sequence. In particular, we focus on the task of translation and transliteration, which consists of
outputting a text in a target language given an input in a difference source language. We compare
ThirdAI’s BOLT model to a standard LSTM model architecture on 4th Gen AMD EPYC 9654 CPU and report the results
on two datasets below.

 ModelHardwareExact Sequence
Accuracy
Training Time
per epoch
Total Training
Time
Inference Time
per query
ThirdAI’s BOLT4th Gen AMD EPYC 9654 CPU27%50 mins150 mins0.38 ms
LSTM Seq2Seq4th Gen AMD EPYC 9654 CPU9.3%300 min900 min2.28 ms
Table 3: Evaluation on the Aksharantar Multilingual Transliteration Dataset

 

Model Hardware AccuracyTraining Time per epochTotal Training TimeInference time per query
ThirdAI’s BOLT 4th Gen AMD EPYC 9754 CPU39% 65 s 65 s 10 ms
LSTM Seq2Seq 4th Gen AMD EPYC 9654 CPU20.31% 586 s 586 s 29.3 ms
LSTM Seq2Seq A100 20.31% 16 s 16 s 12 ms
Table 4: Evaluation on the Multi-30K Translation Dataset from English to German

 

Experiment 3: Tabular Dataset Benchmarks 

Tabular data remains one of the most prevalent formats in business applications and one that can typically be processed on CPUs through popular decision tree-based machine learning frameworks like XGBoost. In this experiment, we compare ThirdAI’s BOLT to XGBoost on 4th Gen AMD EPYC 9654 CPU on two large-scale tabular datasets where deep learning provides a decisive advantage due to the scale of the data. 

Model Hardware Test AccuracyTraining Time per epochTotal Training TimeInference time per query
ThirdAI’s BOLT 4th Gen AMD EPYC 9754 CPU98.45% 65 s 65 s 3.05 ms
ThirdAI’s BOLT 4th Gen AMD EPYC 9654 CPU 96.65% 73 s 73 s 7 ms
XGBoost 4th Gen AMD EPYC 9754 CPU 70.49% — 909 s 13.3 ms
XGBoost 4th Gen AMD EPYC 9654 CPU70.49% — 622 s 12 ms
Table 5: Tabular Data Evaluation on the Character Font Dataset

 

Model Hardware Test AccuracyTraining Time per epochTotal Training TimeInference time per query
ThirdAI’s BOLT 4th Gen AMD EPYC 9754 CPU59.85% 3.32 s 3.32 s 0.55 ms
ThirdAI’s BOLT 4th Gen AMD EPYC 9654 CPU 60.01% 1.48s 1.48s 1 ms
XGBoost 4th Gen AMD EPYC 9754 CPU 58.97% — 4.52 s 3.38 ms
XGBoost 4th Gen AMD EPYC 9654 CPU 58.97% — 2.25 s 5.6 ms
Table 6: Tabular Data Evaluation on the Dota2 Dataset

 

Experiment 4: Text Classification 

Text classification, the process of predicting a label for a given input text, is another fundamental machine learning task in business settings with applications ranging from sentiment analysis to intent prediction. In this experiment, we evaluate BOLT against a state-of-the-art pre-trained RoBERTa model that requires access to a GPU for efficient training. We find that BOLT can achieve close to the state-of-the-art accuracy on two representative datasets while training in a fraction of the time as RoBERTa on 4th Gen AMD EPYC 9654 CPU. 

Yelp Polarity 

Model Test Accuracy
ThirdAI’s BOLT 92.3% (not pre-trained)
RoBERTa 94.5% (pre-trained and fine-tuned)
Model Hardware Training Time per epoch Total Training Time Inference Latency
ThirdAI’s BOLT 4th Gen AMD EPYC 9754 CPU23s (Full Training) 230s <1ms
ThirdAI’s BOLT 4th Gen AMD EPYC 9654 CPU 13s (Full Training) 130s <1ms
RoBERTa 4th Gen AMD EPYC 9654 CPU3hrs (Fine-Tuning) 9.1hrs 40ms
RoBERTa NVIDIA A100 0.59hrs (Fine-Tuning) 1.77hrs 

Amazon Polarity 

Model Test Accuracy
ThirdAI’s BOLT 89% (not pre-trained)
RoBERTa 93% (pre-trained and fine-tuned)
Model Hardware Training Time per epoch Total Training Time Total Training Time Inference Latency
ThirdAI’s BOLT 4th Gen AMD EPYC 9754 CPU 117s (Full training) 348s <1ms
ThirdAI’s BOLT 4th Gen AMD EPYC 9654 CPU 60s (Full Training) 186s <1ms
RoBERTa 4th Gen AMD EPYC 9654 CPU22.6hrs (Finetuning) 67.8hrs (Finetuning) 40ms
RoBERTa NVIDIA A100 3.3hrs (Finetuning) 10hrs (Finetuning) 

Experiment 5: Criteo 46MM DLRM 

For our final experiment, we turn our attention to the Deep Learning Recommendation Model  architecture for recommendations and personalization. This architecture is at the core of all major industrial recommendation systems and is responsible for billions of dollars in revenue in each year. Given the importance of DLRM in commercial applications, it is also included in the official MLPerf benchmarking competition. In this experiment, we find that ThirdAI’s highly efficient CPU-based DLRM implementation on 4th Gen AMD EPYC 9654 CPU outperforms the official NVIDIA benchmark on an A100 GPU by a factor of 4x in training with negligible impact to model quality.

Model Hardware Test AUCTraining Time per epochTotal Training TimeInference throughput
ThirdAI’s BOLT 4th Gen AMD EPYC 9754 CPU 80.2% 12.75 mins 13 mins 418K/sec
ThirdAI’s BOLT 4th Gen AMD EPYC 9654 CPU80.28% 13.78 mins 13.78 mins 381K/sec
DLRM Official NVIDIA A100 80.3% 60 mins 82.2 mins 170K/sec