Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM InferenceYue ZhuHao Yuet al.2025CLOUD 2025Short paper
Scalable and Efficient LLM Serving with the vLLM Production StackJunchen JiangYue Zhu2025OSSNA 2025Talk
How Low Can LoRA Go: System-Level Throughput, Energy, and Model Quality Tradeoffs when Fine-Tuning AdaptersConnor EspenshadeUmesh Deshpandeet al.2025ISCA 2025Workshop paper
Optimizing GPU Multiplexing for Efficient and Cost-Effective Access to Diverse Large Language Models in GPU ClustersYue ZhuChen Wanget al.2024MASCOTS 2024Conference paper
GPU OPTIMIZATIONS FOR EFFICIENT AND COST-EFFECTIVE ACCESS TO DIVERSE LARGE LANGUAGE MODELS IN RESEARCH CLUSTERChen WangYue Zhuet al.2024MLSys 2024Poster
Towards Pareto Optimal Throughput in Small Language Model ServingPol G. RecasensYue Zhuet al.2024EuroSys 2024Workshop paper