Research

Research Mission

The main research approach that we have taken is to gain deep understanding of the target systems and applications’ needs and to rethink systems’ core abstractions based on the understanding; we iterate between these two activities. We have been motivated by real problems experienced by developers and operators in practice. With these first-hand experiences, we perform comprehensive measurements to pinpoint system bottlenecks, characterize application behaviors and their needs, and understand practitioners’ pain points.

A systems solution often requires designing a new abstraction or adapting an existing abstraction in a novel way (e.g., in a new scenario, under new settings or assumptions). We take different routes when designing new systems solutions. On the one hand is the freedom of doing blue sky research, where we openly question the fundamentals underlying distributed systems in face of rapidly-emerging needs and challenges. For example, we rethink the design of modern cloud storage systems and build the first, cost-effective cloud caching system that exploits serverless functions as a novel storage medium. On the other hand, we seek practical and easily-deployable solutions for otherwise sophisticated systems problems. Distributed systems often involve complex cross-component interactions in order to support cross-cutting tasks such as scheduling. This leads us to gravitate towards simple and general solutions that solve not a specific problem, but which satisfy a general set of applications. For instance, we design a new function scheduler, which bridges the divide between user-space scheduling and kernel scheduling while being transparent to any serverless platforms.

Research Projects

We make our research and its artifacts, including software and datasets, accessible to support use and development both in academia and industry.

The research artifacts are publicly available at github.com/ds2-lab.

Serverless AI

Interactive ML workloads need instant access to elastic GPU resources. λScale and MorphServe enable scalable and fast horizontal scaling and vertical scaling for bursty LLM inference workloads. ZenFlow speeds up LLM fine-tuning by decoupling GPU-CPU updates to reduce stalls. NotebookOS enables on-demand GPUs for Jupyter-based interactive training.

λScale: arXiv
NotebookOS: arXiv
ZenFlow: arXiv
MorphServe: arXiv

Storage Systems for AI

This line of research rethinks storage system designs to sustain the exponential AI data explosion. zLLM and BitX are new lossless compressing algorithms that reduce massive LLM storage footprint by 50%. ELF and ELVES near-losslessly compress ML models to achieve effective model storage reduction. SHADE and FedCaSe intelligently cache most important training samples without losing training quality.

zLLM: arXiv
ELF: pdf code
FedCaSe: pdf code
SHADE: pdf code

Redesigning FaaS Platforms

Custom FaaS container support is gaining traction as it enables better control over OSes, versioning, and tooling for modernizing FaaS applications. Our research looks to build scalable FaaS container platform that 1) offers fast container provisioning and 2) minimizes cold starts under concurrent workloads.

FaaSNet: pdf code
CIDRE: pdf code

Serverless Storage

We argue that the emerging serverless computing paradigm provides a well-suited, cost-effective platform to fundamentally achieve elastic data caching and data storage. Check out our serverless storage project series:

InfiniCache: pdf project
InfiniStore: pdf code
λFS: pdf code

Serverless Analytics

Running complex data analytics jobs on FaaS platforms is appealing but also poses challenges for serverless execution frameworks, which will need to rapidly scale and schedule tasks. Our research pioneers to innovate serverless data analytics to make life easier for data scientists.

Wukong: pdf project

Serverless OS Scheduling

The execution time of serverless functions is typically short and thus is sensitive to resource contention. The CPU schedulers of today's main stream operating systems are simply not designed for short-job-dominant FaaS workloads. Our research proposes new scheduling techniques to address this mismatch.

SFS: pdf code
ALPS: pdf code

Research Sponsors

We are grateful for the generous support from our sponsors, including National Science Foundation, Adobe Research, CloudBank, 4-VA 4 Initiatives, Amazon Web Services, Meta Research, Samsung, Google Cloud, and IBM Cloud.

SPX: Collaborative Research: Cross-stack Memory Optimizations for Boosting I/O Performance of Deep Learning HPC Applications

Award Info: National Science Foundation Award CCF-2318628
PI: Yue Cheng
Funding Amount: $320,603

OAC Core: SMALL: DeepJIMU: Model-Parallelism Infrastructure for Large-scale Deep Learning by Gradient-Free Optimization

Award Info: National Science Foundation Award OAC-2007976
PIs: Liang Zhao (Emory), Yue Cheng
Funding Amount: $498,609

MRI: Acquisition of an Adaptive Computing Infrastructure to Support Compute- and Data-Intensive Multidisciplinary Research

Award Info: National Science Foundation Award MRI-2018631
PIs: Elise Miller-Hooks (GMU), Shobita Satyapal (GMU), Maria Emelianenko (GMU), Yue Cheng, Jayshree Sarma (GMU)
Funding Amount: $750,000

CAREER: Harnessing Serverless Functions to Build Highly Elastic Cloud Storage Infrastructure

Award Info: National Science Foundation Award CNS-2322860
PI: Yue Cheng
Funding Amount: $572,897 + $16,000 REU

FMSG: Cyber: Federated Deep Learning for Future Ubiquitous Distributed Additive Manufacturing

Award Info: National Science Foundation Award CMMI-2134689
PIs: Jia Liu (Auburn), Nima Shamsaei (Auburn), Yue Cheng
Funding Amount: $498,762

Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction 🆕

Award Info: National Science Foundation Award OAC-2403313
PI: Yue Cheng
Funding Amount: $299,973

Elements: A Sustainable, Resource-Efficient Cyberinfrastructure for Notebook Interactive ML Training Workloads 🆕

Award Info: National Science Foundation Award OAC-2411009
PIs: Yue Cheng, Geoffrey Fox
Funding Amount: $600,000

REU Site: THE DATA JUSTICE ACADEMY 🆕

Award Info: National Science Foundation Award SMA-2349503
PIs: Claudia Scholz, Yue Cheng
Funding Amount: $481,232

Serverless Storage Management for Large-scale Analytics Workloads

Award Info: Adobe Research Gift
PI: Yue Cheng
Funding Amount: $95,000

Serverless and Scalable GNN Training with Disaggregated Compute and Storage

Award Info: Meta Research Award
PIs: Yue Cheng, Liang Zhao (Emory)
Funding Amount: $50,000

ML Workload Acceleration

Award Info: 4-VA Collaborative Grant
PIs: Huaicheng Li (VT), Yue Cheng
Funding Amount: $5,000

Highly Efficient Pre-Trained LLM Storage with Near-Storage Compression and CXL Memory Integration 🆕

Award Info: Samsung GRO 2023 Award
PIs: Yue Cheng, Ali Anwar (UMN)
Funding Amount: $250,000