What is Prompting?
Embedding and The Vector Database Ecosystem: Build grounded query assistant with ChatGPT on any given corpus.
The Preprocessing Step Overview: You need to store both text and vector embedding in the database with vectors being the KEY. The process requires an LLM to convert text chunk to vectors. The LLM should be the same for querying. Caution: Any changes or updates to the LLM requires re-indexing everything in the Vector DB. You need exact same LLM for querying. Changing dimensions are not allowed. Privacy Risk: All the text needs to go to both the embedding models and the vector database. Costly: Every token in the complete text corpus goes to both LLM and Vector DB.
- Caution: Any changes or updates to the LLM require re-indexing everything in the Vector DB. You need the exact same LLM for querying. Changing dimensions is not allowed.
- Privacy Risk: All text must go to the embedding models and the vector database. If both are different managed services, you create two copies of your COMPLETE data at two different places.
- Be Cost Aware: Every token in the complete text corpus goes to LLM and the Vector DB. In the future, if you update your LLM by fine-tuning, upgrading the model, or even increasing your dimensionality, you need to re-index and pay the full cost again.
- Cost Estimate with Managed Services: Let us take a modest estimate of building a Chatbot using the knowledge base of all the Pubmed abstracts for building a healthcare Q&A application. Pubmed has about 35M abstracts, which will roughly go to 100M chunks that need about 100M embeddings. We will have about 25B tokens assuming 250 tokens per chunk. Even if we use a modest vector DB plan (Performance) from Pinecone and a cheaper embedding model price (Babbage V1) from OpenAI, we are looking at approximately $7000–8000 cost per month for the vector DB. This cost excludes any storage fees. In addition, there is a one-time cost of $125,000 for embedding generation based on the number of tokens. We need to pay $125,000 every time we change the embedding model as well. If we are doing 100M queries a month, then we pay additional recurring expense of at least $250,000 per month for the query embedding service and response generation with OpenAI. It is worth noting that PubMed is one of the smaller public retrieval datasets. Enterprises are likely sitting on top of 10–100x larger corpora.
The Q &A Phase: You need the exact same LLM for the question embedding that was used while indexing text chunks. The LLM cannot be modified after indexing. Any training, tuning, will make the search process unusable because the ANN over KEYs may not be consistent. If you want to update or change the LLM you need to reindex. Caution: The query latency is the sum of Embedding latency + Vector DB Query Latency + GenAI’s text generative Latency.
- Caution: The query latency is the sum of three latencies: Embedding the question text latency + Vector DB retrieval Latency + GenAI’s text response generation latency. If you use several managed services and different micro-services, be prepared to wait at least hundreds of milliseconds before getting an answer. Clearly, this is prohibitively slow for search engines, e-commerce, and other latency-critical application where more than 100ms latency leads to poor user experience. Here is an Amazon blog on how every 100ms delay costs 1% in sales.
- Cost: As discussed in previous section the cost of query could be significant and locked in once your data is on external managed services.
The Known Fundamental Limits of Embed and Vector Search: Why Modern Information Retrieval Wisdom Advocates for Learning to Index?
In the final part (Part 3/3), we will discuss ThirdAI’s production-ready Neural Database APIs and its integration with Langchain and ChatGPT. Our solution sidesteps the embedding process completely as well as the expensive, slow, and rigid limitations of vector database retrieval and we can’t wait to share what we’ve built!