Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Distributed LLM serving systems optimize per-request latency and throughput. However, under long-context workloads, inference accuracy becomes more variable. When incorrect responses trigger retries, accuracy directly translates into cumulative user-visible delay that is not captured by single-shot latency metrics.
In this work, we argue that under long-context serving, through retry dynamics. We introduce , a metric that measures the wall-clock time required to obtain the first correct response. Our measurement study shows that prompt characteristics such as length and language amplify accuracy variance, which inflates TTCA. We demonstrate , a capability-based routing design that reduces TTCA. Our results suggest that in long-context distributed serving, accuracy should be treated as a first-class systems objective.
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Jiaxi Li, Yue Zhu, et al.
EuroSys 2026
Timothy Chainer, Liz Hulihan, et al.
ARPA-E COOLERCHIPS Kickoff Meeting 2023
Jose Manuel Bernabe' Murcia, Eduardo Canovas Martinez, et al.
MobiSec 2024