Ongoing Projects

Using Cached Knowledge for Large Language Model Serving with CacheFuse

Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input.

May 21, 2024

Using Cached Knowledge for Large Language Model Serving with CacheFuse

Resource allocation for Multi-Tenant Retrieval Augmented Generation Systems

Retrieval Augmented Generation (RAG) is the recent state-of-the art paradigm which lets Large Language Models (LLMs) generalise to text generation tasks in new domains, for which they have not been trained.

May 1, 2024

Resource allocation for Multi-Tenant Retrieval Augmented Generation Systems

Knowledge Streaming from LLMs to Environments

Nowadays, not only can we chat with chatGPT, but we can also let it write and execute code to do more complex tasks (e.g. letting chatGPT plot the boundary of America using Python).

Apr 1, 2024

Knowledge Streaming from LLMs to Environments

CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge or user-specific information. Yet using long contexts poses a challenge for responsive LLM systems, as nothing can be generated until the whole context is processed by the LLM.

Jan 31, 2024

CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving

Earth+: on-board satellite imagery compression leveraging historical earth observations

Satellite imagery is useful for a wide range of applications, from automatic road detection to forest monitoring. But did you know only 2% of satellite images can actually be downloaded to the ground?

Jan 31, 2024

Earth+: on-board satellite imagery compression leveraging historical earth observations

GRACE: Loss-Resilient Real-Time Video through Neural Codecs

In real-time video communication, retransmitting lost packets over high-latency networks is not viable due to strict latency requirements. To counter packet losses without retransmission, two primary strategies are employed -- encoder-based forward error correction (FEC) and decoder-based error concealment.

Jan 1, 2024