Announcement: PocketLLM was featured on ProductHunt! | Check out the latest AWS blog on ThirdAI benchmarks 

Understanding the Fundamental Limitations of Vector-Based Retrieval for Building LLM-Powered Chatbots— (Part 1/3)

This blog is the first in a series of posts explaining why the mainstream pipeline for deploying domain-specialized chatbots with large language models (LLMs) is too costly and inefficient. In this first post, we discuss why vector databases, despite the recent surge in popularity, are fundamentally limited when deployed in real production pipelines. In our following posts, we illustrate how our latest product releases at ThirdAI address these shortcomings and deliver on the vision of deploying LLM-powered retrieval in production at low cost.

Motivation

Domain-specialized chatbots are the most popular enterprise application for ChatGPT. An automated Q&A capability with a specified corpus of knowledge can make any employer’s workforce more productive while saving employees precious time. For illustration, if an employee interacts with a client, having all historical interactions with that client at your fingertips will be pretty handy. If you want to contribute to a large code base, it can make you quite productive if you can quickly get hold of any existing functionalities at a fine-grained level. The list goes on.
ChatGPT is a great conversational tool, and it is trained on a huge amount of textual information found on the internet. If you ask ChatGPT about general knowledge from the internet, it can answer quite well. However, it has some significant limitations. ChatGPT cannot answer questions whose answers are not part of its training data. So if you ask ChatGPT, “Who won the soccer world cup 2022?” it won’t be able to answer as it is not trained on any information after September 2021. Enterprises sit on piles of very specialized, proprietary, and constantly updating corpora of information, and ChatGPT out of the box won’t be a query assistant for that knowledge base. To make things worse, it is now well known that queries to ChatGPT without proper guardrails are likely to result in made-up answers.
Fortunately, there is significant activity around addressing the above two shortcomings using prompting.

What is Prompting?

Prompting is a new jargon for telling a conversational agent all the specific information needed to answer the question. It then relies on the agent’s conversational capabilities to produce a polished answer. If you want ChatGPT to answer a specific question that is not part of its training set, you have to essentially make ChatGPT aware of all the information it needs to know in less than 4096 tokens (or roughly 3200 words, the limit goes to 25,000 words for GPT-4) and then ask it the same question with the given “context.”
However silly it may sound, prompting is still a valuable capability. Automating human-like conversation is a rare feat we have recently achieved with remarkable advancements in Generative AI. Effectively, building a query assistant boils down to the classical problem of “retrieving information” relevant to the query and then generating conversational answers grounded on the retrieved information using the capabilities of ChatGPT. We can see that this automatically puts guardrails around hallucinations because the conversational agent is forced to ground the answers into the retrieved text, which is the subset of the knowledge base.

The hardest part is always finding the needle in the haystack!

Embedding and The Vector Database Ecosystem: Build grounded query assistant with ChatGPT on any given corpus.

There has been a flurry of chatbot applications built using Langchain, where you can bring in any text corpus and interact with it using ChatGPT. All those applications are built on a standard embedding-based information retrieval process.
The process has two main phases. The first phase is a pre-processing step to generate embedding and build the vector index for near-neighbor search. After the index is built, the next phase is querying. We briefly go over the two phases.
Pre-processing Step: This step takes all the raw text and builds an index that can be searched efficiently. The process is described in the figure below.

The Preprocessing Step Overview: You need to store both text and vector embedding in the database with vectors being the KEY. The process requires an LLM to convert text chunk to vectors. The LLM should be the same for querying. Caution: Any changes or updates to the LLM requires re-indexing everything in the Vector DB. You need exact same LLM for querying. Changing dimensions are not allowed. Privacy Risk: All the text needs to go to both the embedding models and the vector database. Costly: Every token in the complete text corpus goes to both LLM and Vector DB.

Let’s say we have a corpus of text documents to prepare for Q&A. The first step is breaking the corpus (or text documents) into small blocks of text, which we call chunks (the process is also called chunking). Each chunk is then fed to a trained language model like BERT or GPT to generate vector representation, also known as embedding. The text embedding pair is then stored in a vector database or a store, with KEY being vector embedding and VALUE being the text chunk. The unique feature of a vector database is the capability to perform approximate near-neighbor (ANN) search on vectors efficiently for KEY matching instead of performing exact KEY matches in a traditional database.
  • Caution: Any changes or updates to the LLM require re-indexing everything in the Vector DB. You need the exact same LLM for querying. Changing dimensions is not allowed.
  • Privacy Risk: All text must go to the embedding models and the vector database. If both are different managed services, you create two copies of your COMPLETE data at two different places.
  • Be Cost Aware: Every token in the complete text corpus goes to LLM and the Vector DB. In the future, if you update your LLM by fine-tuning, upgrading the model, or even increasing your dimensionality, you need to re-index and pay the full cost again.
  • Cost Estimate with Managed Services: Let us take a modest estimate of building a Chatbot using the knowledge base of all the Pubmed abstracts for building a healthcare Q&A application. Pubmed has about 35M abstracts, which will roughly go to 100M chunks that need about 100M embeddings. We will have about 25B tokens assuming 250 tokens per chunk. Even if we use a modest vector DB plan (Performance) from Pinecone and a cheaper embedding model price (Babbage V1) from OpenAI, we are looking at approximately $7000–8000 cost per month for the vector DB. This cost excludes any storage fees. In addition, there is a one-time cost of $125,000 for embedding generation based on the number of tokens. We need to pay $125,000 every time we change the embedding model as well. If we are doing 100M queries a month, then we pay additional recurring expense of at least $250,000 per month for the query embedding service and response generation with OpenAI. It is worth noting that PubMed is one of the smaller public retrieval datasets. Enterprises are likely sitting on top of 10–100x larger corpora.
2. Query Phase: Embed and ANN Search followed by Generation via a Prompt
This step takes user-typed questions, searches the vector database for text content “most relevant” to the question, and then solicits a response from a GenAI based on that information. The steps are summarized in the Figure below.

The Q &A Phase: You need the exact same LLM for the question embedding that was used while indexing text chunks. The LLM cannot be modified after indexing. Any training, tuning, will make the search process unusable because the ANN over KEYs may not be consistent. If you want to update or change the LLM you need to reindex. Caution: The query latency is the sum of Embedding latency + Vector DB Query Latency + GenAI’s text generative Latency.

For the Q&A phase, the process is straightforward. We first generate the vector embedding of the query using the same LLM used for indexing the Vector DB. This embedding serves as the query KEY, and an approximate near-neighbor search (ANN) is performed to find a few vectors in DB closest to query embedding. The measure of closeness is pre-defined and fixed and is usually the cosine similarity. After identifying close vectors, their corresponding text chunks serve as the information relevant to the question. The relevant information and question are then fed to Generative AI like ChatGPT via a prompt to generate a response.
  • Caution: The query latency is the sum of three latencies: Embedding the question text latency + Vector DB retrieval Latency + GenAI’s text response generation latency. If you use several managed services and different micro-services, be prepared to wait at least hundreds of milliseconds before getting an answer. Clearly, this is prohibitively slow for search engines, e-commerce, and other latency-critical application where more than 100ms latency leads to poor user experience. Here is an Amazon blog on how every 100ms delay costs 1% in sales.
  • Cost: As discussed in previous section the cost of query could be significant and locked in once your data is on external managed services.

The Known Fundamental Limits of Embed and Vector Search: Why Modern Information Retrieval Wisdom Advocates for Learning to Index?

Along with the issues of latency, cost, inflexibility to update models, and privacy mentioned above, there is a fundamental shortcoming of disconnecting the embedding process (KEY generation) with Cosine Similarity based ANN (Text Retrieval).
A Far-Fetched Assumption and Andrej Kaparthy’s recent experiments: The implicit assumption behind the whole ecosystem is that the cosine similarity between vector embeddings will retrieve relevant text. It is a known fact that there may be better options. These LLMs are not fine-tuned for cosine similarity retrieval, and other similarity functions are likely to work better. Here is Andrej Kaparthy’s post and his notebook on how he found SVM-based similarity to be better.
The deep learning revolution has taught us that a jointly optimized retrieval system is always better than a disconnected process with embed and then ANN, where the ANN process is entirely oblivious to the embed part and vice versa.
Thus, if the ultimate aim of the vector search ecosystem is to retrieve “relevant text” for a question asked, why have two disconnected processes? Why not have a unified learned system that, given a question text, returns the “most relevant” text? No wonder Andrej found that a learned SVM is better than simple dot product retrieval. The information retrieval community has been building such jointly optimized embedding and retrieval systems for almost half a decade.
The most potent form of Neural Information Retrieval Systems is Learning to Index. In Part 2/3 of this blog, we will review learning to index and discuss previously deployed learned systems in the industry. We will walk over Neural Database an end-to-end learning to index system, which completely bypasses expensive and cumbersome high dimensional near-neighbor search over vectors.

In the final part (Part 3/3), we will discuss ThirdAI’s production-ready Neural Database APIs and its integration with Langchain and ChatGPT. Our solution sidesteps the embedding process completely as well as the expensive, slow, and rigid limitations of vector database retrieval and we can’t wait to share what we’ve built!