Junchen's talk on the LLM Caching Layer at Databricks and Alluxio’s AI/ML Infra meetup.

May 22, 2024 1 min read

Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

Talk link :

Slides link :

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG