Latest News | Junchen's Lab

Hanchen presented CacheGen paper at SIGCOMM'24.

CacheGen is a LLM KV cache compression and streaming system that saves storage size and reduces transfer time. By utilizing delta compression and information theory, CacheGen is able to further compress KV cache by up to 4.3x on top of previous machine learning techniques with option to incur NO additional loss.

Link to a pre-recorded version:

Aug 4, 2024 1 min read

Junchen's talk on Knowledge-Delivery Networks at ByteDance and APNet'24.

Imagine once an LLM has processed a document, the "knowledge" can be instantly shared with other LLM instances. Unfortunately, today, LLMs must read the same long document multiple times, causing a significant slowdown. We introduce a new Knowledge Delivery Network that enables LLMs to efficiently share their digested knowledge, in the form of KV caches, so only one LLM instance needs to process (prefill) each document. The key challenge is how to store the KV caches cheaply and serve them quickly. Instead of keeping the KV caches of all reusable chunks in GPU or CPU memory, we show that with careful design and implementation, storing them on cheaper devices is not only economically superior but also delivers significant reductions in LLM serving delay, especially the time to the first token.

Slides link : Download

Aug 4, 2024 1 min read

Yuhan Liu presented ChameleonAPI at OSDI'24.

ChameleonAPI provides application developers with a parser that automatically analyzes the application to produce an abstract of its decision process, which is then used to devise an application-specific loss function that only penalizes API output errors critical to the application. ChameleonAPI uses the loss function to efficiently train a neural network model customized for each application and deploys it to serve API invocations from the respective application via existing interface. Compared to a baseline that selects the best-of-all commercial ML API, ChameleonAPI reduces incorrect application decisions by 43%.

Check out the paper for more details : ChameleonAPI

Jul 10, 2024 1 min read

Zhengxu Xia successfully completes his PhD candidacy exam at UChicago.

Congratulations Zhengxu!

Jun 10, 2024 1 min read

Yihua Cheng starts his internship at Conviva.

Congratulations Yihua!

Jun 4, 2024 1 min read

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion out on ArXiv.

Check out the paper for more details : Paper

Jun 3, 2024 1 min read

Yuhan Liu starts her internship at Microsoft Research.

Congratulations Yuhan!

May 28, 2024 1 min read

Yuhan Liu successfully completes her MS exam at UChicago.

Congratulations Yuhan!

May 28, 2024 1 min read

Junchen's talk on the LLM Caching Layer at Databricks and Alluxio’s AI/ML Infra meetup.

Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

Talk link :

Slides link :

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

May 22, 2024 1 min read

Kuntai Du starts his internship at Sky Computing Lab at Berkeley.

The Sky Computing Lab represents the next chapter of data-intensive systems research at Berkeley. Recent years have seen the explosion of cloud computing. Applications are moving their data and computation to the cloud; on-premise services are dying. In doing so, companies have to make difficult choices between the myriad of cloud providers, each with different services or hardware. Lock-in, whether through artificial migration costs, legal constraints or engineering baggage is real. In the Sky Computing Lab, we will leverage distributed systems, programming languages, security, and machine learning to decouple the services that a company wants to implement from the choice of a specific cloud.

Congratulations Kuntai!

May 10, 2024 1 min read