
Introduction:
ThirdAI is a startup dedicated to democratizing state-of-the-art machine learning capabilities through
algorithmic and software innovations that enable training and deploying large models on low-cost
CPU hardware. Over the past year, a number of customers across a broad swath of industries, have
successfully trained and deployed ThirdAI models for their business needs, often reporting both
improved model quality and reduced training and deployment costs from migrating their modeling
infrastructure to CPUs.
In addition to the proprietary algorithmic innovations at the heart of ThirdAI’s software, we have also
benefited from the continued advancements in CPU technology. In this report, we benchmark
ThirdAI’s BOLT deep learning engine on AMD’s latest-generation EPYC 9004 Series CPUs to
measure the benefits of combining our machine learning technology with state-of-the-art CPU hardware.
In short, we find that the AMD EPYC 9004 Series CPUs dramatically accelerates BOLT training
across a variety of representative machine learning tasks, which further fuels our optimism in building
a future where deep learning on general-purpose CPUs becomes the standard.
Experiment 1: Graph Node Classification:
AMD Blog: Benchmarking ThirdAI’s BOLT Engine on AMD EPYC 9004 Series CPUs 2
For our first benchmarking experiment, we explore the domain of machine learning on graphs. In
particular, we focus on the task of classifying nodes in the network given a set of features. For this
experiment, we evaluate on the Yelp-Chi fraud detection dataset and the Pokec social network.
In the first table below, we report the test accuracy of our BOLT model compared to two well-
established graph neural network baselines, Graph Convolutional Networks (GCN) and Graph
Attention Networks (GAT). We highlight that BOLT in fact achieves worldwide state-of-the-art
performance on the Yelp-Chi dataset. In Table 2, we report the training time for all models across the
AMD EPYC 9004 Series CPUs and an NVIDIA A100 GPU.
Model | Yelp (ROC-AUC) | Pokec (P@1) |
ThirdAI’s BOLT | 91.1 (93.1 with longer training) | 78.0 |
Graph Convolutional Network (GCN) | 63.62 | 75.45 |
Graph Attention Network (GAT) | 81.42 | 71.77 |
Model | Hardware | Yelp | Pokec |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9754 CPU | 10s (1s per epoch) | 439s (44s per epoch) |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9654 CPU | 10s (0.9s per epoch) | 340.9s (38.1s per epoch) |
Graph Convolution (GCN) | 4th Gen AMD EPYC 9754 CPU | 150s (6s poer epoch) | 600s (24s per epoch) |
Graph Convolution (GCN) | 4th Gen AMD EPYC 9654 CPU | 150s (6s per epoch) | 537.5s (21.5s per epoch) |
Graph Attention (GAT) | 4th Gen AMD EPYC 9754 CPU | 200s (20s per epoch) | 1340 (134s per epoch) |
Graph Attention (GAT) | 4th Gen AMD EPYC 9654 CPU | 230s (23s per epoch) | 1004s (110s per epoch) |
Graph Convolution (GCN) | NVIDIA A100 | 4.51s (0.009s per epoch) | 22.86s (0.048s per epoch) |
Graph Attention (GAT) | NVIDIA A100 | 22.5s (0.05s per epoch) | 59.3s (0.12s per epoch) |
Experiment 2: LSTM Benchmarks:
In our second experiment, we conduct a benchmark of sequence-to-sequence (Seq2Seq) modeling
tasks, which involve generating an output sequence, such as text, in response to a given input
sequence. In particular, we focus on the task of translation and transliteration, which consists of
outputting a text in a target language given an input in a difference source language. We compare
ThirdAI’s BOLT model to a standard LSTM model architecture on 4th Gen AMD EPYC 9654 CPU and report the results
on two datasets below.
Model | Hardware | Exact Sequence Accuracy | Training Time per epoch | Total Training Time | Inference Time per query |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9654 CPU | 27% | 50 mins | 150 mins | 0.38 ms |
LSTM Seq2Seq | 4th Gen AMD EPYC 9654 CPU | 9.3% | 300 min | 900 min | 2.28 ms |
Model | Hardware | Accuracy | Training Time per epoch | Total Training Time | Inference time per query |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9754 CPU | 39% | 65 s | 65 s | 10 ms |
LSTM Seq2Seq | 4th Gen AMD EPYC 9654 CPU | 20.31% | 586 s | 586 s | 29.3 ms |
LSTM Seq2Seq | A100 | 20.31% | 16 s | 16 s | 12 ms |
Experiment 3: Tabular Dataset Benchmarks
Tabular data remains one of the most prevalent formats in business applications and one that can typically be processed on CPUs through popular decision tree-based machine learning frameworks like XGBoost. In this experiment, we compare ThirdAI’s BOLT to XGBoost on 4th Gen AMD EPYC 9654 CPU on two large-scale tabular datasets where deep learning provides a decisive advantage due to the scale of the data.
Model | Hardware | Test Accuracy | Training Time per epoch | Total Training Time | Inference time per query |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9754 CPU | 98.45% | 65 s | 65 s | 3.05 ms |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9654 CPU | 96.65% | 73 s | 73 s | 7 ms |
XGBoost | 4th Gen AMD EPYC 9754 CPU | 70.49% | — | 909 s | 13.3 ms |
XGBoost | 4th Gen AMD EPYC 9654 CPU | 70.49% | — | 622 s | 12 ms |
Model | Hardware | Test Accuracy | Training Time per epoch | Total Training Time | Inference time per query |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9754 CPU | 59.85% | 3.32 s | 3.32 s | 0.55 ms |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9654 CPU | 60.01% | 1.48s | 1.48s | 1 ms |
XGBoost | 4th Gen AMD EPYC 9754 CPU | 58.97% | — | 4.52 s | 3.38 ms |
XGBoost | 4th Gen AMD EPYC 9654 CPU | 58.97% | — | 2.25 s | 5.6 ms |
Experiment 4: Text Classification
Text classification, the process of predicting a label for a given input text, is another fundamental machine learning task in business settings with applications ranging from sentiment analysis to intent prediction. In this experiment, we evaluate BOLT against a state-of-the-art pre-trained RoBERTa model that requires access to a GPU for efficient training. We find that BOLT can achieve close to the state-of-the-art accuracy on two representative datasets while training in a fraction of the time as RoBERTa on 4th Gen AMD EPYC 9654 CPU.
Yelp Polarity
Model | Test Accuracy |
ThirdAI’s BOLT | 92.3% (not pre-trained) |
RoBERTa | 94.5% (pre-trained and fine-tuned) |
Model | Hardware | Training Time per epoch | Total Training Time | Inference Latency |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9754 CPU | 23s (Full Training) | 230s | <1ms |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9654 CPU | 13s (Full Training) | 130s | <1ms |
RoBERTa | 4th Gen AMD EPYC 9654 CPU | 3hrs (Fine-Tuning) | 9.1hrs | 40ms |
RoBERTa | NVIDIA A100 | 0.59hrs (Fine-Tuning) | 1.77hrs |
Amazon Polarity
Model | Test Accuracy |
ThirdAI’s BOLT | 89% (not pre-trained) |
RoBERTa | 93% (pre-trained and fine-tuned) |
Model | Hardware | Training Time per epoch Total Training Time | Total Training Time | Inference Latency |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9754 CPU | 117s (Full training) | 348s | <1ms |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9654 CPU | 60s (Full Training) | 186s | <1ms |
RoBERTa | 4th Gen AMD EPYC 9654 CPU | 22.6hrs (Finetuning) | 67.8hrs (Finetuning) | 40ms |
RoBERTa | NVIDIA A100 | 3.3hrs (Finetuning) | 10hrs (Finetuning) |
Experiment 5: Criteo 46MM DLRM
For our final experiment, we turn our attention to the Deep Learning Recommendation Model architecture for recommendations and personalization. This architecture is at the core of all major industrial recommendation systems and is responsible for billions of dollars in revenue in each year. Given the importance of DLRM in commercial applications, it is also included in the official MLPerf benchmarking competition. In this experiment, we find that ThirdAI’s highly efficient CPU-based DLRM implementation on 4th Gen AMD EPYC 9654 CPU outperforms the official NVIDIA benchmark on an A100 GPU by a factor of 4x in training with negligible impact to model quality.
Model | Hardware | Test AUC | Training Time per epoch | Total Training Time | Inference throughput |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9754 CPU | 80.2% | 12.75 mins | 13 mins | 418K/sec |
ThirdAI’s BOLT | 4th Gen AMD EPYC 9654 CPU | 80.28% | 13.78 mins | 13.78 mins | 381K/sec |
DLRM Official | NVIDIA A100 | 80.3% | 60 mins | 82.2 mins | 170K/sec |