Two roads to Large Language Models (LLMs): CPUs with “dynamic sparsity” OR specialized hardware.
A Popular Task: Leveraging Domain-Specific Unstructured Text for AI
At ThirdAI, we have worked with several customers trying to leverage AI over millions of domain-specific unstructured texts to automate key decision-making processes. The first step in this direction is to generate semantically meaningful embeddings, which requires access to large language models for state-of-the-art performance. Embeddings are vector representations of entities, such as text documents, with the property that the similarity between entities reflects in the vector similarity. Embeddings allow us to map raw, unstructured data into compact vector representations that we can leverage for a variety of computational tasks such as search, recommendations, explainability, or as inputs to predictive AI models. This article provides a nice intuitive explanation of embeddings.
However, the cost of generating embeddings has traditionally remained prohibitively high, dominating the Total Cost of Ownership (TCO) of AI/NLP systems. In this post, we compare the capabilities, cost, and carbon footprint of OpenAI, the most popular hosted cloud solution for embedding generation, and ThirdAI, which offers both hosted cloud and on-premise solutions for embeddings.
Comparisons
Motivating Scenario: Our target company has about 100 million documents with an average of 5k tokens in the documents. The company wants to fine-tune the large language models on these documents monthly because they generate new documents and more information. It also wants to rectify mistakes from last month’s models. Also, in a month, this company generates about 100 million queries with the fine-tuned model to serve its clients.
OpenAI: OpenAI is an AI research and deployment company. OpenAI uses a significant amount of GPUs to train their AI models. They also host their models on GPU platforms.
ThirdAI: ThirdAI is a startup focused on democratization of AI by building a new software stack that leverages smarter algorithms and “dynamic sparsity” rather than matrix multiplications and specialized hardware accelerators. ThirdAI runs all heavyweight AI computations including training of large language models (LLMs) on commodity CPUs. ThirdAI is the only company that enables training large neural models on CPU hardware. Building models on CPUs allows ThirdAI to train and refresh models right where the data is generated, which also reduces data transfer costs. ThirdAI also supports distributed data-parallel training on Ray clusters, where we take advantage of the larger memory capacity on CPUs to train large distributed networks without costly model-parallel splitting.
Monthly Cost of Ownership:

We use the CHEAPEST option, Ada, from OpenAI, as described on their pricing page. The costliest one, DaVinci, is 50 times more expensive.
ThirdAI offers several pricing options, including a flat software subscription cost, which we exclude from the above estimate. Instead, we show the cost associated with renting machines on AWS to complete an embedding generation job via ThirdAI’s CPU-based deep learning software stack.
Comment: OpenAI seems quite expensive, and we are not the only ones to report it. Another study also found their pricing to be outrageously high.
Capabilities Comparison:

Carbon Footprint Comparison:

We estimate the carbon footprint for model inference only and assume fine-tuning scales proportionally with the forward pass cost of the neural networks. To calculate carbon footprint, we used a standard online tool. To get the estimate, we need to input the time, computational hardware, and memory footprint. For ThirdAI, this is straightforward, as we use standard Xeon CPUs.
For OpenAI, we do not precisely know the hardware in use. But we need at least 8 A100 GPUs to run an equivalent inference of OPT models released by Meta. The carbon footprint estimation tool we use does not provide numbers for A100 GPUs yet, so we assume 8 V100 GPUs as the required hardware for a conservative estimate. We also assumed a typical 60ms inference latency for inference with that hardware, which is again likely to be an underestimate. We kept other available options on the tool exactly the same for both the companies.
With this methodology, we see that ThirdAI lowers the carbon footprint of large-scale AI model serving by an order of magnitude. These findings align with our earlier study of the carbon footprint of our models
Technological Leap: Beyond Tensors Hash-based Deep Learning on CPUs: The good news is that algorithmic breakthroughs have shown remarkable progress. Recently, hash-based sparsity-inducing algorithms on CPUs have been shown to provide orders of magnitude faster acceleration compared to state-of-art GPUs for training large neural models. These advances will likely change the economics of AI infrastructure for training large neural models.
Code:
OpenAI: Here is a link to OpenAI’s documentations.
ThirdAI: Here is a link to ThirdAI’s simple API to train and generate embedding. Check out a nice visualization of Kaggle 110,000 amazon product catalog here. The training notebook is available. We can train 512-dimensional neural embedding model for the complete 110,000 products catalog on a 16-core CPU Box in just about 3 hours. We can also train it on Google Collab. A demo of a search engine built using unsupervised raw text catalog can be found here.