Imagine once an LLM has processed a document, the "knowledge" can be instantly shared with other LLM instances. Unfortunately, today, LLMs must read the same long document multiple times, causing a significant slowdown. We introduce a new Knowledge Delivery Network that enables LLMs to efficiently share their digested knowledge, in the form of KV caches, so only one LLM instance needs to process (prefill) each document. The key challenge is how to store the KV caches cheaply and serve them quickly. Instead of keeping the KV caches of all reusable chunks in GPU or CPU memory, we show that with careful design and implementation, storing them on cheaper devices is not only economically superior but also delivers significant reductions in LLM serving delay, especially the time to the first token.
ChameleonAPI provides application developers with a parser that automatically analyzes the application to produce an abstract of its decision process, which is then used to devise an application-specific loss function that only penalizes API output errors critical to the application. ChameleonAPI uses the loss function to efficiently train a neural network model customized for each application and deploys it to serve API invocations from the respective application via existing interface. Compared to a baseline that selects the best-of-all commercial ML API, ChameleonAPI reduces incorrect application decisions by 43%.
The Sky Computing Lab represents the next chapter of data-intensive systems research at Berkeley. Recent years have seen the explosion of cloud computing. Applications are moving their data and computation to the cloud; on-premise services are dying. In doing so, companies have to make difficult choices between the myriad of cloud providers, each with different services or hardware. Lock-in, whether through artificial migration costs, legal constraints or engineering baggage is real. In the Sky Computing Lab, we will leverage distributed systems, programming languages, security, and machine learning to decouple the services that a company wants to implement from the choice of a specific cloud.
Congratulations Kuntai!