April 2 |
Kexin Pei, Andrew Chien
UChicago
|
How AI transforms your research fields Meeting summary
Andrew’s slides
|
April 9 |
Sanjay Krishnan
UChicago
|
The Devil Has A Long Tail: Part 2
The Devil Has A Long Tail: Part 2 Abstract: When I was in graduate school, a prominent computer systems professor gave a talk about emerging distributed cloud services titled “The Devil Has A Long Tail”. The basic premise was that the emergence of distributed eventually consistent services was “a deal with the devil” – a promise of instant scalability at the cost of debugging the long tail of concurrency, performance, and programmability bugs. I argue that the application of LLMs in enterprise software today creates a similar dilemma: a promise of instant generality at the cost of a long tail of debugging both systems and machine learning problems. I show examples from recent enterprise LLM deployments that expose: (1) complex failure and rare failure modes, (2) deployment considerations that complicate debugging, and (3) the endless growth of scaffolding code that checks/corrects LLM outputs.
Read More
|
Rescheduled |
Ce Zhang
UChicago/Together.ai
|
Building an Ecosystem for Open Foundation Models, Together
Building an Ecosystem for Open Foundation Models, Together Abstract: In this talk, I hope to share insights and experiences from our collaboration with the community to enhance open source foundation model ecosystems. A primary opportunity (and challenge) lies in balancing and jointly optimizing data quality, model architecture, and infrastructure. This includes managing the vast scale and cost of GPU clusters, optimizing their use, and reasoning about data quality in a principled manner to enhance model quality. To this end, we have focused our efforts on several technical problems, such as developing the RedPajama dataset, which tries to provide a modular perspective on data quality; communication optimization algorithms to accelerate learning across disaggregated infrastructures; and optimized inference infrastructure through the deep co-design of systems and model architecture. In this talk, I will describe our learnings from some of these projects and hope to receive feedback from everyone on how we can collectively advance the open source foundation model ecosystem. Bio: Ce is currently the CTO of Together.ai and the incoming Neubauer Associate Professor of Data Science at the University of Chicago. He was an Associate Professor in Computer Science at ETH Zurich. The mission of his research is to make machine learning techniques widely accessible—while being cost-efficient and trustworthy—to everyone who wants to use them to make our world a better place. He believes in a systems approach to enabling this goal, and his current research focuses on building next-generation machine learning platforms and systems that are data-centric, human-centric, and declaratively scalable. Before joining ETH, Ce completed his PhD at the University of Wisconsin-Madison and spent another year as a postdoctoral researcher at Stanford, both under the guidance of Christopher Ré. His work has received recognitions such as the SIGMOD Best Paper Award, SIGMOD Research Highlight Award, Google Focused Research Award, an ERC Starting Grant.
Read More
|
April 23 |
Hao Zhang
UCSD
|
Dynamic speculative decoding Meeting summary Paper link
Dynamic speculative decoding Meeting summary Paper link Abstract: Speculative decoding is a pivotal technique to reduce large language model inference latency, yet it faces two major challenges in practice: (1) the complexity to identify and deploy an appropriate draft model (2) its potential to slow down inference due to the nature of speculation. This talk covers two new techniques to fill these gaps. I’ll first discuss lookahead decoding, a new speculative decoding method that can operate without a draft model and trade log(Flops) to linearly reduce decoding steps. I’ll then show the practical challenges when applying speculative decoding in a real LLM serving environment (with continuous batching and batch size > 1). Contrary to expectations, under slightly higher request rates, speculative decoding often exacerbates latency issues rather than mitigate them. To address this challenge, I’ll present Dynamic Speculative Decoding (DSD), a new scheduling technique that dynamically modulates the propose length based on an objective that characterize the observed system load. I’ll show that DSD not only guarantees to reduce average request latency, but also adapts to various styles of speculatie decoding methods (draft model, self-speculation, tree-verification, etc.). Bio: Hao Zhang is an Assistant Professor in Halıcıoğlu Data Science Institute and the Department of Computer Science and Engineering at UC San Diego. Before joining UCSD, Hao was a postdoctoral researcher at UC Berkeley (2021 - 2023). Hao completed his Ph.D. in Computer Science at Carnegie Mellon University (2014 - 2020). During PhD, Hao took on leave and worked for the ML platform startup Petuum Inc (2016 - 2021). Hao’s research interest is in the intersection area of machine learning and systems. Hao’s past work includes Vicuna, FastChat, Alpa, vLLM, Poseidon, Petuum. Hao’s research has been recognized with the Jay Lepreau best paper award at OSDI’21 and an NVIDIA pioneer research award at NeurIPS’17. Hao also cofounded the company LMNet.ai (2023) which has joined Snowflake since November 2023, and the nonprofit LMSYS Org (2023) which maintains many popular open models, evaluation, and systems.
Read More
|
April 30 |
Ted Shaowang
UChicago
|
StreamServe: Low-Latency Model Serving over Distributed Data Streams Meeting summary Project link
StreamServe: Low-Latency Model Serving over Distributed Data Streams Meeting summary Project link Abstract: The relevant features for a machine learning task may arrive as one or more continuous streams of data. Serving machine learning models over streams of data creates a number of interesting systems challenges in managing data routing, rate control, and data-model co-optimization. This talk presents StreamServe, a distributed streaming system that can serve predictions from machine learning models in real time. As an example, I will walk through a high-bandwidth network traffic analysis task, where our system dynamically selects the number of packets to collect and the models to use for individual network flows. Bio: Ted Shaowang is a 5th-year PhD student at the Department of Computer Science of University of Chicago advised by Sanjay Krishnan. Ted’s research interest is in the intersection area of streaming systems and machine learning model serving.
Read More
|
May 14 |
Bo Li
UChicago
|
Risk Assessment, Safety Alignment, and Guardrails for Generative Models
Risk Assessment, Safety Alignment, and Guardrails for Generative Models Abstract: Large language models (LLMs) have garnered widespread attention due to their impressive performance across a range of applications. However, our understanding of the trustworthiness and risks of these models remains limited. The temptation to deploy proficient foundation models in sensitive domains like healthcare and finance, where errors carry significant consequences, underscores the need for rigorous safety evaluations, enhancement, and guarantees. Recognizing the urgent need for developing safe and beneficial AI, our recent research seeks to design a unified platform to evaluate the safety of LLMs from diverse perspectives such as toxicity, stereotype bias, adversarial robustness, OOD robustness, ethics, privacy, and fairness; enhance LLM safety through knowledge integration; and provide safety guardrail and certifications. In this talk, I will first outline our foundational principles for safety evaluation, detail our red teaming tactics, and share insights gleaned from applying our DecodingTrust platform to different models, such as proprietary and open-source models, as well as compressed models. Further, I will delve into our methods for enhancing model safety, such as hallucination mitigation. I will also explain how knowledge integration helps align models and prove that the RAG framework achieves provably lower conformal generation risks compared to vanilla LLMs. Finally, I will briefly discuss our robust guardrail framework for risk mitigation in practice. Bio: Dr. Bo Li is the Neubauer Associate Professor in the Department of Computer Science at the University of Chicago. She is the recipient of the IJCAI Computers and Thought Award, Alfred P. Sloan Research Fellowship, IEEE AI’s 10 to Watch, NSF CAREER Award, MIT Technology Review TR-35 Award, Dean’s Award for Excellence in Research, C.W. Gear Outstanding Faculty Award, Intel Rising Star Award, Symantec Research Labs Fellowship, Rising Star Award, Research Awards from Tech companies such as Amazon, Meta, Google, Intel, IBM, and eBay, and best paper awards at several top machine learning and security conferences. Her research focuses on both theoretical and practical aspects of trustworthy machine learning, which is at the intersection of machine learning, security, privacy, and game theory. She has designed several scalable frameworks for certifiably robust learning and privacy-preserving data publishing. Her work has been featured by several major publications and media outlets, including Nature, Wired, Fortune, and New York Times. Her website is http://boli.cs.illinois.edu/
Read More
|
Oct 15 |
Kuntai Du
UChicago
|
Optimizing Communication for Distributed LLM Inference
Optimizing Communication for Distributed LLM Inference Abstract: Previous work has identified GPU memory capacity as the primary bottleneck in LLM inference, necessitating effective KV cache management strategies. However, based on a long-term collaboration with the most popular open-source serving engine vLLM, we observe that the landscape is shifting toward distributed LLM inference due to new emerging trends such as long-context KV cache reuse, disaggregated prefilling, and multi-modal LLM inference. The central challenge thus evolves to efficient KV cache communication mechanisms. This talk explores potential solutions for optimizing communication in distributed systems and argues that effective communication requires two new roles: an orchestrator and a KV cache store. These roles must collaborate closely to meet stringent service-level objectives, paving the way for scalable and efficient distributed LLM-serving systems. Bio: Kuntai Du is a 6th-year PhD from UChicago. His research focus is data transfer for distributed inference systems, including analytic-aware video streaming in distributed video analytic settings and effective KV cache transfer for distributed LLM inference. He is the recipient of Siebel Scholarship.
Read More
|
Oct 22 |
Qizheng Zhang
Stanford
|
Practical Online Learning of In-Network ML Models via a Generative Labeling Agent
Practical Online Learning of In-Network ML Models via a Generative Labeling Agent Abstract: Recent work on in-network machine learning (ML) assumes offline models will perform well in modern networks. However, these models often struggle with fluctuating traffic and network conditions, necessitating frequent online validation and updates. In this talk, I will present Caravan, a practical online learning system for in-network ML. Caravan addresses two key challenges: (a) automatic labeling of evolving traffic and (b) efficient model quality monitoring. By repurposing existing systems like heuristics and foundation models, Caravan generates high-quality labeled data and introduces an “accuracy proxy” metric to track and mitigate model drift. As a case study, I will demonstrate how foundation models (e.g. GPT-4) can be adapted for near real-time labeling of high-volume network traffic. Bio: Qizheng Zhang is a 3rd-year PhD student at Stanford University. His recent research focus on scaling AI models, including large language models (LLM) serving, high-speed traffic analysis, and video analytics, while addressing key system challenges such as performance optimization, resource management, and reliability. He is also exploring the integration of LLM agents into large-scale systems.
Read More
|