From the course: Fundamentals of AI Engineering: Principles and Practical Applications
Scaling strategies (caching)
From the course: Fundamentals of AI Engineering: Principles and Practical Applications
Scaling strategies (caching)
- [Narrator] Hi, everyone. Welcome back and welcome to our session on caching in vector databases. To get started, open chapter_5, and open the file called 05_05.ipynd. As always, in the upper right-hand corner, make sure that the virtual environment that you've selected is the .vn virtual environment. We've discussed caching a few times in this course, and that's for good reason. As you build production AI systems, the effect and importance of caching is paramount. Caching is an optimization technique that allows us to significantly reduce latency and computational load on our critical systems by storing and saving frequently accessed results. Now, why should we actually do this? There's three concrete immediate reasons that come to mind. First, again, reduced latency. Cached results can be returned instantly without computing embeddings or searching the vector space. When you're building AI applications and the user experience is paramount, ensuring that we minimize latency is pivotal to the user experience. Second, lower computational costs. With caching, fewer embedding calculations are required, which means lower GPU and CPU utilization, which means for you lower costs. Third is better scalability. We can handle more queries with the same resources assuming we have a cache that prevents us from recomputing things that we already have. Today, we're going to implement a simple LRU cache, and measure its performance impact in our ecosystem. To get started, let's install the necessary libraries. In my situation, these are already installed, but for you, they need it to be pulled and it may take a while. Next, let's import our necessary libraries and start looking at a basic implementation of an LRU cache. Now, you've already seen these. LRU stands for Least Recently Used cache. This type of cache keeps track of which queries are used more frequently, and evicts the least recently used entries when the cache is full. Let's look at my implementation pretty quickly. First, it's a class named LRU cache that has a default capacity value of 100, if not overridden. The cache itself is a dictionary, and there's a usage order that keeps track of what occurred and when in the cache. The get method first checks to see if the key is in the cache, and if so, updates the order, or returns and says, "I don't have this key in it." The put either updates the existing entry or adds a new entry while evicting the least recently used item if necessary. Clear clears out the cache. Great. Now, let's actually get to using this cache. We're going to create a Chroma client, an in-memory client in this case, and a collection named cache_test. Next, we're going to look at the number of documents, 1,000 in this case, and we're going to try and insert all of these to our collection. In this case, we're going to generate 1,000 documents that contain the content. This is a sample document index number with various content for texting, caching. Great. We've added these 1,000 documents to the collection, and now we can actually look at the effect of our cache in this collection. Let's first create an LRU cache with a capacity of 50. We're then also going to have a function to query with caching. Now, there's some interesting usage here, thinking about how we actually do want to use cache or if we don't want to use the cache, collecting information about cache misses and everything in that realm. But I'm going to just run this for now to first show you what happens when we use this cache. Now, to simulate a realistic workload, we're going to create a list of common, meaning frequently repeated, and unique entries. First, we're going to have these queries that say things like document with content, testing caching, sample document, various content, so on and so forth. I'm going to run this cell, and get the results of doing my actual testing. First, let's measure the performance and the difference of the performance specifically between running queries without a cache and with a cache. So in this situation, we're going to run all of our queries without a cache. Great. Now we're going to run our queries with a cache. And we're going to report our results. Before I even run this, intuitively, you would assume that using the cache saves resources and is faster. And by running this, we can see that it's absolutely the case. Without the cache, it took 2.13 seconds. And with the cache, it actually took 0.5 seconds. That saved us about 74% or 1.5 seconds. We had a 73.9% cache hit rate with a cache size of 47. What this should tell you is we were able to more than double our speed by using something like caching in a production setting. Of course, code spaces is not production, but it gives you the understanding and even the exposure to what this may look like and the results that you may experience when working in production settings. Now, let's talk a little bit about advanced scaling strategies. For many of you, this may be the first time you're working in AI ecosystems and AI systems as a whole. For those of you that have a bit more experience, I would be remiss if I didn't mention some of these. First, as is the case in most databases, it's very likely that your vector database will need to grow beyond the capacity of a single machine. And by doing that, you'll need to implement horizontal scaling strategies. Here's some common approaches. First, sharding, you can partition your vector space across multiple instances by either ID range or even by vector clustering. Second, you can create copies for data across multiple instances. These are even read replicas, or if you've played in technologies like Cassandra, replication becomes pretty critical. Third, you can use hybrid approaches, combining sharding and replication. In this case, we can actually create a ChromaDB cluster with data sharded across nodes and each node replicated. As we continue going to production and thinking about this space, resource management becomes paramount. First, using things like memory optimization will become incredibly critical as you're building and exploring in this space. You don't have to know how to use techniques like quantization just yet, but I want you to be exposed to this word quantization. Quantization allows us to reduce the vector size generally from 32 bit or 64 bit to eight bit, which allows us to actually save a lot of compute memory in this operating model. Second, it allows us to implement disc-based storage for least frequently accessed vectors. Not every vector needs to be stored in memory or in a dictionary like we had. We can actually create and store these vectors on disc. As we're managing resources, CPU utilization becomes critical. Batching similar operations and using asynchronous processing where possible should be top of mind. We never want to block the user on waiting for an operation to return, especially a computationally expensive one. Third, network efficiency. We should do our best to minimize data transfer between components when possible. This involves strategies like compression and ensuring that data's even co-located if it's going to be related to each other. To wrap things up, I want to talk about some real world implementation considerations. So far, we've built a lot in code spaces, but done fairly little in terms of monitoring and observability. As you're working in this space, it's important to track metrics like latency, throughput, and error rates, setting up alerts for especially performance segregation. Should also equally be monitoring failures. How do you actually manage, handle, and notify on these failures? Third, update strategies become very important. Batch submitting updates to reduce index rebuilding frequency is a pretty critical piece of this process, and as a technology trend that's being used by more and more individuals and systems more recently. Finally, we have hybrid approaches. We'll combine vector search with keyword search actually very soon, as you'll see for better results. So, so far we were able to use technologies like caching to achieve significant latency savings. And we were able to show that we're most effective when creative plans show temporal locality. There's a number of advanced scaling strategies as well that we can get into in subsequent videos or subsequent courses. But as the whole, the importance of caching should be paramount and clear as you begin your investigations. Thank you so much.