The Pain of Maintaining, Storing, and Searching Embeddings
- Embeddings are significantly heavy objects: Simpler OpenAI models like Ada-02 generate about 1500-dimensional embeddings for each text chunk. The text chunk is about 250 tokens (averaging 4 characters per token). Storing 100 million Pubmed chunks would require roughly 600GB just for storing the embeddings. In comparison, the complete data of uncompressed raw text is only 200GB. More accurate LLM models have embedding dimensions exceeding 12,000, which would entail approximately 5.5 TB of storage solely for handling embedding vectors.
- Approximate Near-Neighbor Search (ANN) with high-dimensional embeddings is either slow or inaccurate: It has been recognized for over three decades that high-dimensional near-neighbor search, even in approximate form, is fundamentally difficult. Most ANN algorithms, including the popular graph-based HNSW, require heavyweight data structure management to ensure reliable high-speed search. Any ANN expert knows that a search’s relevance and performance are heavily dependent on the distribution of vector embeddings, making it quite unpredictable. Moreover, as the embedding dimensions increase, maintaining ANN, its search relevance, and latency will likely face significant challenges.
- Updates and Deletions are Problematic with an ANN Index: Most modern vector databases and ANN systems are built on HNSW or other graph traversal algorithms, where the embedding vectors are nodes. Due to the nature of how these graph indexes are constructed, updating nodes based on changes in the document content can be a very slow operation because it requires updating the edges of the graph. Deleting documents can also be slow for the same reason. The dynamic nature of updates to the embeddings can even affect the overall accuracy of retrieval. Thus, incremental updates to the database are very fragile. And rebuilding from scratch is typically too costly.
- Retrieval failures are hard to evaluate and fix: When a given text query fails to retrieve relevant grounded context and instead provides unrelated or garbage text, there can be three reasons for this failure: a) The relevant text chunk does not exist in the database, b) The embeddings are of poor quality and hence couldn’t match two relevant texts using cosine similarity, c) The embeddings were fine, but due to the distribution of embeddings, the approximate near-neighbor algorithm couldn’t retrieve the correct embedding. While reason (a) is acceptable because the question seems irrelevant to the dataset, distinguishing between reasons (b) and (c) can be a tedious debugging process. Furthermore, we have no control over ANN search, and refining the embeddings may not resolve the issue. Thus, even after identifying the problem, we may be unable to fix it.
Continually Adaptive Domain-Specific Retrieval Systems: Embedding-Free Neural Databases
The system utilizes powerful large neural networks to generate a memory location that map text to discrete keys. These predicted keys act as buckets for insertion and later retrieval of relevant text chunks. Essentially, it’s a good old hash map where the hash function is a large neural network trained to predict the pointers. To train the network, we need “semantically relevant” pairs of texts and a standard cross-entropy loss. For more detailed information, please refer to the theoretical and experimental comparisons presented in the 2019 NeurIPS paper and the subsequent KDD 2022 paper. Mathematically, it can be shown that the size of the model scales logarithmically with the number of text chunks, resulting in an exponential improvement in both running time and memory. No embedding management is required in this approach.
Major Advantages of Neural Databases over Embeddings and ANN
- No Embeddings Leading to Exponential Compression: The additional memory needed with our approach lies only in storing the parameters of neural networks. We found that a 2.5 billion parameter neural network is sufficient to train and index the complete Pubmed 35M dataset. The training was purely self-supervised, as in we didn’t need any labeled samples. Even with all the overheads, we have less than 20GB of storage for the complete index. In comparison, the number was at least 600GB to store 1500 dimensional embedding models with a vector database. There is no surprise here because with embedding models, compute and memory scale linearly with the number of chunks. In contrast, our neural database scales only logarithmically with the number of chunks, as proved in our NeurIPS paper.
- Manage Insertion and Deletions like a Traditional DB: Unlike the case of a graph-based near-neighbor index, a neural database has simple KEY, VALUE type hash tables, where insertion, deletion, parallelization, sharding, etc., are straightforward and very well understood.
- Ultra-fast Inference and Significant Reduction in Cost: The inference latency only consists of running neural network inference, followed by a hash table lookup. In the end, only selected chunks require simple weighted aggregation and sorting of only a handful of candidates. You will likely see 10–100x faster retrieval compared to embeddings and vector databases. Furthermore, with ThirdAI’s groundbreaking sparse neural network training algorithms, we can train and deploy these models on ordinary CPUs.
- Continual Learning with Incrementally Teachable Indexing: The neural index can be trained, using any pair of texts that are similar in semantic meaning. This implies that the retrieval system can be continually trained to specialize for any desirable task or domain. Getting text pairs for training is not hard. First they can be easily generated in self-supervised manner. Also, they are naturally available in any production system with user interaction.