Question Answering

Doc Search Demo

We demonstrate state of the art retrieval accuracy with sub-100 ms latency on document search on a modest CPU, 25x faster than ColBERT inference on CPU.

A CASE STUDY WITH 8 MILLION MS MARCO PASSAGE RETRIEVAL AND LESS THAN 100MS RESPONSE TIME ON COMMODITY CPU.

Question Answering System using Natural Language Processing

?

Questions

Bolt
Answers
What lies at the heart of all new-gen services like the ed-tech industry, enterprise search industry, and automated customer service solutions (like chatbots)? It is a fundamental piece of technology called ‘Question Answering’. This facet of Natural Language Understanding is one of the most challenging problems that modern Deep Learning is trying to solve.

Given an input question string: “When did Beyoncé start becoming popular?” The first prerequisite of a question answering system is to return a small paragraph of text, out of several millions of possibilities, containing the information that can answer the input question.

In our current case, the relevant paragraph is the first paragraph from wikipedia: “Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say; born September 4, 1981)[4] is an American singer, songwriter, and actress. Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the lead singer of Destiny’s Child, one of the best-selling girl groups of all time. Their hiatus saw the release of her debut album Dangerously in Love (2003), which featured the US Billboard Hot 100 number-one singles “Crazy in Love” and “Baby Boy.”

Once the relevant paragraph is found, the subsequent task is relatively a standard language model inference to highlight the sentence, or phrase, containing the answer.

The hardest part of the question answering system is to find the relevant paragraph out of tens to hundreds of millions of possible documents.

Why Elastic Search and Current NLP Models like ColBERT need a breakthrough to be feasible?

Finding the matching documents requires us to go beyond simple keyword matching popularly offered by elastic search. For instance, in the current example, the only keyword that the query and paragraph have in common is “Beyonce” (after text normalization). Keyword match won’t figure out that “fame” and “popular” are related. As expected, the elastic search will lead to poor accuracy.

Deep Learning has made remarkable strides in developing state-of-the-art models in this space like ColBERT. ColBERT and the likewise models rely on three fundamental steps to 1) obtain a pre-trained language model and fine-tune it on your data 2) index all you documents/passages using your learned model, 3) develop an inference pipeline that takes a question/query in real-time and highlights an answer in the most relevant passage. They achieve significantly higher accuracy compared to lexical-only elastic search.

ColBERT requires expensive GPU boxes and several days to train and index. Furthermore, new documents will arrive every day (or every hour if you’re a large entity), and the learned models will keep drifting from reality. This necessitates constant re-training and re-indexing, which is simply infeasible or too expensive for several companies.

And the woes don’t stop there; question-answering is also latency-critical. Any system expecting users to wait more than 100ms per query will likely see poor adoption. Systems like ColBERT are too slow to achieve low latency on standard hardware. High-speed inference is not ideal for GPUs as batching queries does not reduce individual latencies. Furthermore, batching is not always feasible due to variable traffic.

The ThirdAI difference

ThirdAI delivers high accuracy and performant AI on commodity CPUs. Our technology is based upon scientifically proven hash-based processing algorithms, which unlock game-changing accuracy by training significantly bigger models at ease. As a result, our commodity CPUs are sufficient to capitalize on the scaling laws of neural networks.

We benchmark our engine on MS MARCO, one of the most popular information retrieval benchmarks. The result is summarized in the table below: 

Benchmark Document Retrieval on MS MARCO (8M paragraphs).**

Method, all with 16 vCPUs, k=1000 P50 ms/Q P95 ms/Q P99 ms/Q MRR@10 R1@50 R1@1000
ColBERT V2 (SOTA), CPU
721
873
949
0.395
0.856
0.965
ThirdAI BOLT Small
71
93
101
0.355
0.821
0.954
ThirdAI BOLT Big
100
129
140
0.386
0.85
0.962
Elastic Search
53
98
136
0.174
0.55
0.809

** P50 is latency at 50 percentile. Meaning 50% of queries will see latency less than P50. MRR is Mean Reciprocal Rank. R1@k means recall of best answer within top-k retrieved.

We can thus obtain the speed and scalability of lexical search with the accuracy of the best NLP models.

Unlock the complete power of AI with ThirdAI’s BOLT

Base

tier 1

Core Deep Learning Recommendation Engine

Reduce cost

State-of-the-art AI/NP accuracy

< 1ms inference latency on CPUs

Privacy compliant

Add-on

tier 2

Sequential and Personalization

Captures temporal patterns of user behavior and choice of products

Personalized search and recommendation based on contextual information

Add-on

tier 3

Graph Neural Network and Explainability

Ingest relationships between users and products

Identify the most relevant features

Add-on

tier 4

Continual Learning

Model accuracy improves over time with usage

Automatically identify and rectify mistakes

ThirdAI’s BOLT unlocks the power of AI and NLP (Natural Language Processing) in any product search engine without any effort.  Our push-button solution integrates with existing search engines without any engineer or data scientist’s effort. BOLT consumes historical data of past customer interactions and automatically builds and deploys the AI model with the highest seen accuracy and at 100x less cost for AI.

BOLT gives you a complete handle on training, retraining, and deployment in any given environment or infrastructure of your choice. Our scalable AI can be easily extended to provide personalization. In addition,  BOLT can leverage and mine known relationships between entities in the form of a graph.  Furthermore, BOLT provides a continually learning solution, which means that the model evolves automatically with more usage.

In addition to autotuning all the hyperparameters, BOLT also autotunes for the available resources and latency budget.  Therefore there is no need to spend on expensive data scientists and ML engineers. Overall, BOLT takes care of the complete AI cycle on any infrastructure.