Power-aware Deep Learning Model Serving with µ-Serve
Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2024
In the rapidly evolving landscape of Large Language Models (LLMs), overcoming low GPU cluster utilization (as low as 20-30% in traditional setups) is crucial for efficiently serving these models in Kubernetes. This talk will share insights from deploying and serving LLMs using MIG partitions and dynamic resource allocation (DRA). Our experiments discovered that the optimal GPU MIG partition size depends on the specific LLM model and its load, highlighting the necessity and feasibility of using Dynamic Resource Allocation (DRA) for dynamically scaling model-serving instances vertically.
We'll showcase deploying the open-source vLLM framework in Kubernetes, focusing on scaling vLLM instances for increased loads while maximizing GPU utilization. Attendees will gain practical knowledge on selecting effective MIG partitions for different models and using DRA to optimize their model-serving systems.
Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2024
Oleg Kolosov, Gala Yadgar, et al.
ICDCS 2023
Shiqiang Wang, Mingyue Ji
NeurIPS 2022
Olivier Tardieu, David Grove, et al.
PLDI 2023