Announcement: Checkout out our new BOLT2.5B LLM – World’s first Generative LLM trained exclusively on CPUs

GPT vs Domain Specialized LLMs: Jack of All Trades Vs Master of Few.

If you are not “pre-training” LLMs on your local “domain specific” text corpus, you are likely seeing sub-optimal results for your application.

LLMs and Neural Scaling Laws

Large Language Models (LLMs), a famous illustration being ChatGPT, are now a must need strategy in most enterprises and industry verticals. LLMs are massively large neural networks for languages, typically comprising of over a billion parameters. They are an ideal illustration of the power of Neural Scaling Laws, which imply that outrageously large models, when given a large amount of data, automatically start exhibiting super-intelligence phenomena and reasoning. This phenomenon cannot be observed on a small scale with small models trained with limited data. A good analogy is Einstein’s theory of relativity, whose effects are only measurable when a particle nears the speed of light and otherwise is unnoticeable in the visual world around us, where Newton’s laws seem perfectly accurate.

LLM JARGONS: Pre-training, Fine-tuning, and Embeddings.

It is common knowledge that LLMs are first pre-trained over a large amount of unsupervised text from various domains. We won’t need pre-training if we have “enough” supervised data. However, unsupervised information in the form of natural text is often freely available but labeled supervised data is rare. Natural texts have enough information to provide self-supervision to train these massive LLMs. For example, if we randomly pick a sentence from the internet – “We need to peel bananas before eating”, we can create a supervised task of filling in the blanks, just like in a kindergarten textbook. For the input text “We need to ____ bananas before eating”, the answer is the word “peel.” It is easy to see that every word from every sentence on the internet provides one input-output example, creating enough self-supervised data to train the LLMs. This training from a raw text by generating self-supervised tasks is called “pre-training”.
After the pre-training process, it is expected that LLMs now understand language. They have learned a “semantic vector representation”, also called embeddings of any given text. The text can be words, sentences, paragraphs, conversations, documents, etc. In formal terms, after pre-training, the vector representation of the words “Apple” and “Orange” will be closer, in the distance, than the words “Apple” and “House”. Similarly, the representation of the text “What city has Empire State Building?” and “New York City” will have vector representations that are close in vector distance compared to the vector representation of the text “Houston”.
Since LLMs are generally trained on general public information, the widely accepted wisdom is to fine-tune them with supervised labeled data to specialize them in a domain-dependent task. It is taken for granted that LLMs would be able to do their best on whatever supervised data is available.
However, in many domains, minimal supervised information is associated with the text. Worse, in many fields, we don’t have any. Most enterprises sit on a pile of unstructured domain specific text in the form of internal documentation, customer interactions, and many more.

General LLMs (Like GPT) vs. Domain-Specific LLMs:

There is another possibility here, pre-training the LLMs on the self-supervised task associated with the domain-specific text. This pre-training is likely more targeted as it understands the language of the domain, the jargon, the correlations of different terms, etc.
To illustrate the difference, let’s compare representations of text obtained from the “generic” GPT-3 (text-ada-002) model fromOpenAI with a “specialized” billion-parameter FoodUDT-1B model from ThirdAI that is pre-trained on 500K recipes. Our FoodUDT-1B model specializes in recipe search. We compare the internal representation of these two models with simple word representations. Let’s start with the words “apple,” “iPhone,” and “apple pie.” According to GPT -3’s vector representation, “apple” is closer to “iPhone” than “apple pie.” On the other hand, FoodUDT-1B thinks that vector representations of “apple” and “apple pie” are significantly closer than “iPhone.” It is easy to see the reason. FoodUDT-1B only has mastery about food recipes, while GPT-3 has mastery of the internet where “apple” is a company brand that produces “iPhone”; hence, the two words are very related.
Let’s now consider “meat,” “sugar,” and “dessert.” Here, GPT-3 thinks all three words are roughly equally related, with “meat” being slightly closer to “sugar” than “dessert” (hard to tell why). For ThirdAI’s FoodUDT-1B, on the other hand, “meat” and “sugar” are less similar than “dessert” and “sugar.” It again makes sense because, for generic GPT-3, all three are generic food items in the global context of the whole world. For specialized FoodUDT-1B, dessert recipes always have sugar, while meat recipes often do not.
GPT-3, when it comes to food recipes, knows about it at a high level having a basic common knowledge of food. At the same time, FoodUDT-1B is specialized in food and has internal understanding at much deeper levels.
Please note that FoodUDT-1B is trained only on raw recipes and has no access to any supervised data mapping queries to a recipe.

Check out the Performance Difference in 5 minutes with ThirdAI’s three lines of Python codes:

It is relatively easy to see the difference between GPT and domain specialized LLMs on a benchmark dataset called Scifact in a few minutes. ThirdAI’s simple script (3 function calls of generic python code) gets better accuracy than SOTA models. The critical difference is that ThirdAI’s model was pre-trained, from scratch, just on the Scifact text from Beir Benchmarks (literally takes a few minutes on a desktop), and there is no other pre-training.
Yet another comparison to play with: Here is OpenAI’s embedding script for getting embeddings and searching with their pre-trained GPT models on AGNews text corpus. Here is ThirdAI’s script on the same dataset and task, except that the model was pre-trained from scratch just on the AG-News text corpus (no other pre-training). The training takes less than 5 minutes or any reasonable desktop CPU (no GPUs needed).

Fundamental Tradeoffs between Domain Expert and General Expertise:

There is a deeper fundamental trade-off here. We can get an expert food model, like FoodUDT-1B, that understands food at a finer level, where the world of concepts is just food. Or we can get a generic model that views food in light of all the concepts around us. We cannot get one model that does both because the model can only have one representation for a given text, after-all it is a program or a function. A generic GPT model trained on diverse datasets views “meat” and “sugar” as high-level entities that belong to food commodities and related industries and hence, are similar. An expert network, like FoodUDT-1B, trained only on recipes, views “meat” and “sugar” as non-complementary flavors. Meat and sugar usually don’t go together in a recipe. For a food expert, everything is about the pairing, the taste, enhancement, etc.
If you are looking for a formal argument, it goes as follows. A GPT model is a function that optimizes internal representation for the average loss over the union of all the information. At the same time, FoodUDT-1B is a function that optimizes the representation for average loss over information that are only related to recipes. The minimizer of sum of loss over several concepts is not the same as the minimizer of loss over individual concepts because minimum is not a linear operator.
The space of LLMs is no different from the world we live in. It is almost impossible to find a chef that can do the best stock market analysis, knows all the news, can teach math and chess, and make great art. We can find someone who knows all the right keywords and appears brilliant at first look, in short, a jack of all trades. It is however, unlikely that a model that know language and all the right keywords will be a true expert. The tradeoffs are far more fundamental.