Announcement: Checkout out our new BOLT2.5B LLM – World’s first Generative LLM trained exclusively on CPUs

Neural Databases: A Next Generation Context Retrieval System for Building Specialized AI-Agents with ChatGPT — (Part 2/3)

This post is a continuation of the blog “Understanding the Fundamental Limitations of Vector-Based Retrieval for Building LLM-Powered AI-Agents— (Part 1/3),” where we highlighted the shortcomings of a decoupled architecture of (1) embedding generation followed by (2) vector search using approximate near neighbor (ANN) search. We discussed how the cosine similarity between vector embeddings produced by Generative AI models may not be the right metric for getting the relevant content for prompting. We also highlighted that storing, updating, and maintaining embeddings via vector databases is very expensive in production settings.
In this post, we will discuss how modern neural databases using learning to index offer a significant upgrade over vector databases in mitigating most of the issues associated with embedding and search. We will conclude by offering a glimpse into the neural database technology we are building to solve these problems at ThirdAI, which we will dive into more in the next post.

The Pain of Maintaining, Storing, and Searching Embeddings

To illustrate the engineering challenges, let’s consider the example of building an AI-Agent with the Pubmed 35M dataset, a small repository by industry standards. This dataset consists of approximately 35 million abstracts, translating to around 100 million chunks requiring 100 million embeddings. Assuming an average of 250 tokens per chunk, we make the following observations:
  1. Embeddings are significantly heavy objects: Simpler OpenAI models like Ada-02 generate about 1500-dimensional embeddings for each text chunk. The text chunk is about 250 tokens (averaging 4 characters per token). Storing 100 million Pubmed chunks would require roughly 600GB just for storing the embeddings. In comparison, the complete data of uncompressed raw text is only 200GB. More accurate LLM models have embedding dimensions exceeding 12,000, which would entail approximately 5.5 TB of storage solely for handling embedding vectors.
  2. Approximate Near-Neighbor Search (ANN) with high-dimensional embeddings is either slow or inaccurate: It has been recognized for over three decades that high-dimensional near-neighbor search, even in approximate form, is fundamentally difficult. Most ANN algorithms, including the popular graph-based HNSW, require heavyweight data structure management to ensure reliable high-speed search. Any ANN expert knows that a search’s relevance and performance are heavily dependent on the distribution of vector embeddings, making it quite unpredictable. Moreover, as the embedding dimensions increase, maintaining ANN, its search relevance, and latency will likely face significant challenges.
  3. Updates and Deletions are Problematic with an ANN Index: Most modern vector databases and ANN systems are built on HNSW or other graph traversal algorithms, where the embedding vectors are nodes. Due to the nature of how these graph indexes are constructed, updating nodes based on changes in the document content can be a very slow operation because it requires updating the edges of the graph. Deleting documents can also be slow for the same reason. The dynamic nature of updates to the embeddings can even affect the overall accuracy of retrieval. Thus, incremental updates to the database are very fragile. And rebuilding from scratch is typically too costly.
  4. Retrieval failures are hard to evaluate and fix: When a given text query fails to retrieve relevant grounded context and instead provides unrelated or garbage text, there can be three reasons for this failure: a) The relevant text chunk does not exist in the database, b) The embeddings are of poor quality and hence couldn’t match two relevant texts using cosine similarity, c) The embeddings were fine, but due to the distribution of embeddings, the approximate near-neighbor algorithm couldn’t retrieve the correct embedding. While reason (a) is acceptable because the question seems irrelevant to the dataset, distinguishing between reasons (b) and (c) can be a tedious debugging process. Furthermore, we have no control over ANN search, and refining the embeddings may not resolve the issue. Thus, even after identifying the problem, we may be unable to fix it.

The Infamous Curse of Dimensionality: ANN over a large number of high-dimensional vectors are fundamentally hard and unpredictable. Avoid the whole process if you can.

Continually Adaptive Domain-Specific Retrieval Systems: Embedding-Free Neural Databases

It turns out that there is a simple AI system that can be trained end-to-end without the need for expensive, heavy, and complex high-dimensional embeddings. The key concept is to bypass the embedding process entirely and approach the retrieval problem as a neural prediction system that can be learned end-to-end. In this approach, neural networks are used to directly map a given query text to the relevant text. This process requires data structures for efficiency. Numerous papers are published each year at conferences such as ICML, NeurIPS, and ICLR exploring these ideas. Our design is a simplified version of a NeurIPS paper, with subsequent research presented at ICLR and KDD.
A neural database also involves two phases, which are described below.
Training and Insertion (or Indexing) Stage: The forward workflow of the system is shown in the following Figure.

The system utilizes powerful large neural networks to generate a memory location that map text to discrete keys. These predicted keys act as buckets for insertion and later retrieval of relevant text chunks. Essentially, it’s a good old hash map where the hash function is a large neural network trained to predict the pointers. To train the network, we need “semantically relevant” pairs of texts and a standard cross-entropy loss. For more detailed information, please refer to the theoretical and experimental comparisons presented in the 2019 NeurIPS paper and the subsequent KDD 2022 paper. Mathematically, it can be shown that the size of the model scales logarithmically with the number of text chunks, resulting in an exponential improvement in both running time and memory. No embedding management is required in this approach.

The query or the retrieval phase: The query or the retrieval phase is equally simple and illustrated in the next Figure.
Given a question, we use the trained neural network classifier to compute the probability of the top few ranking buckets. We then accumulate all the ChunkIDs associated with those top buckets. The top buckets and their associated relevance scores concerning the question are then aggregated and sorted to return a small ranked list of candidate text chunks. These text chunks are then used as a prompt to Generative AI for generating the final grounded response.

Major Advantages of Neural Databases over Embeddings and ANN

We illustrate the advantage of neural databases by through the same Pubmed 35M AI-Agents application.
  • No Embeddings Leading to Exponential Compression: The additional memory needed with our approach lies only in storing the parameters of neural networks. We found that a 2.5 billion parameter neural network is sufficient to train and index the complete Pubmed 35M dataset. The training was purely self-supervised, as in we didn’t need any labeled samples. Even with all the overheads, we have less than 20GB of storage for the complete index. In comparison, the number was at least 600GB to store 1500 dimensional embedding models with a vector database. There is no surprise here because with embedding models, compute and memory scale linearly with the number of chunks. In contrast, our neural database scales only logarithmically with the number of chunks, as proved in our NeurIPS paper.
  • Manage Insertion and Deletions like a Traditional DB: Unlike the case of a graph-based near-neighbor index, a neural database has simple KEY, VALUE type hash tables, where insertion, deletion, parallelization, sharding, etc., are straightforward and very well understood.
  • Ultra-fast Inference and Significant Reduction in Cost: The inference latency only consists of running neural network inference, followed by a hash table lookup. In the end, only selected chunks require simple weighted aggregation and sorting of only a handful of candidates. You will likely see 10–100x faster retrieval compared to embeddings and vector databases. Furthermore, with ThirdAI’s groundbreaking sparse neural network training algorithms, we can train and deploy these models on ordinary CPUs.
  • Continual Learning with Incrementally Teachable Indexing: The neural index can be trained, using any pair of texts that are similar in semantic meaning. This implies that the retrieval system can be continually trained to specialize for any desirable task or domain. Getting text pairs for training is not hard. First they can be easily generated in self-supervised manner. Also, they are naturally available in any production system with user interaction.

The ThirdAI Difference

In the next and final blog post in this series (Part 3/3), we will discuss ThirdAI’s neural database ecosystem and how with “dynamic sparsity” we can tame the elephant, the LLMs, to run in any data processing system, whether on-cloud or on-premises. We will also walk over our set of simple auto-tuned Python APIs. These APIs enable you to harness the power of next-generation learning-to-index on your device. Additionally, we will explain how you can create a grounded Pubmed Q&A AI-Agent using simple CPUs and just few lines of Python code, all while maintaining privacy with an air-gapped environment (no Internet required). As demonstrated in the previous post, building such an AI Agent using standard OpenAI embedding and vector database ecosystem would typically cost hundreds of thousands of dollars. You can get all those for essentially no cost on your personal device with ThirdAI.