A case study with 8 million MS Marco passage retrieval and less than 100ms response time on commodity CPU.
What lies at the heart of all new-gen services like the ed-tech industry, enterprise search industry, and automated customer service solutions (like chatbots)? It is a fundamental piece of technology called ‘Question Answering’. This facet of Natural Language Understanding is one of the most challenging problems that modern Deep Learning is trying to solve.
Given an input question string: “When did Beyoncé start becoming popular?” The first prerequisite of a question answering system is to return a small paragraph of text, out of several millions of possibilities, containing the information that can answer the input question.
In our current case, the relevant paragraph is the first paragraph from wikipedia: “Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say; born September 4, 1981) is an American singer, songwriter, and actress. Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the lead singer of Destiny’s Child, one of the best-selling girl groups of all time. Their hiatus saw the release of her debut album Dangerously in Love (2003), which featured the US Billboard Hot 100 number-one singles “Crazy in Love” and “Baby Boy.”
Once the relevant paragraph is found, the subsequent task is relatively a standard language model inference to highlight the sentence, or phrase, containing the answer.
The hardest part of the question answering system is to find the relevant paragraph out of tens to hundreds of millions of possible documents.
Why Elastic Search and Current NLP Models like ColBERT need a breakthrough to be feasible?
Finding the matching documents requires us to go beyond simple keyword matching popularly offered by elastic search. For instance, in the current example, the only keyword that the query and paragraph have in common is “Beyonce” (after text normalization). Keyword match won’t figure out that “fame” and “popular” are related. As expected, the elastic search will lead to poor accuracy.
Deep Learning has made remarkable strides in developing state-of-the-art models in this space like ColBERT. ColBERT and the likewise models rely on three fundamental steps to 1) obtain a pre-trained language model and fine-tune it on your data 2) index all you documents/passages using your learned model, 3) develop an inference pipeline that takes a question/query in real-time and highlights an answer in the most relevant passage. They achieve significantly higher accuracy compared to lexical-only elastic search.
ColBERT requires expensive GPU boxes and several days to train and index. Furthermore, new documents will arrive every day (or every hour if you’re a large entity), and the learned models will keep drifting from reality. This necessitates constant re-training and re-indexing, which is simply infeasible or too expensive for several companies.
And the woes don’t stop there; question-answering is also latency-critical. Any system expecting users to wait more than 100ms per query will likely see poor adoption. Systems like ColBERT are too slow to achieve low latency on standard hardware. High-speed inference is not ideal for GPUs as batching queries does not reduce individual latencies. Furthermore, batching is not always feasible due to variable traffic.
The ThirdAI difference
ThirdAI delivers high accuracy and performant AI on commodity CPUs. Our technology is based upon scientifically proven hash-based processing algorithms, which unlock game-changing accuracy by training significantly bigger models at ease. As a result, our commodity CPUs are sufficient to capitalize on the scaling laws of neural networks.
We benchmark our engine on MS MARCO, one of the most popular information retrieval benchmarks. The result is summarized in the table below:
Benchmark Document Retrieval on MS MARCO (8M paragraphs).**
Method, all with 16 vCPUs, k=1000
ColBERT V2 (SOTA), CPU
ThirdAI BOLT Small
ThirdAI BOLT Big
** P50 is latency at 50 percentile. Meaning 50% of queries will see latency less than P50. MRR is Mean Reciprocal Rank. R1@k means recall of best answer within top-k retrieved.
We can thus obtain the speed and scalability of lexical search with the accuracy of the best NLP models.