Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving

Takeshi Yoshimura; Valentijn van de Beek; Tatsuhiro Chiba

EuroMLSys 2026

Workshop

27 Apr 2026

Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving

Download paper

Abstract

Distributed LLM serving systems optimize per-request latency and throughput. However, under long-context workloads, inference accuracy becomes more variable. When incorrect responses trigger retries, accuracy directly translates into cumulative user-visible delay that is not captured by single-shot latency metrics.

In this work, we argue that under long-context serving, $\textbf{accuracy becomes speed}$ through retry dynamics. We introduce $\textit{Time-to-Correct-Answer (TTCA)}$ , a metric that measures the wall-clock time required to obtain the first correct response. Our measurement study shows that prompt characteristics such as length and language amplify accuracy variance, which inflates TTCA. We demonstrate $\textit{Lightweight Accuracy-Aware Routing (LAAR)}$ , a capability-based routing design that reduces TTCA. Our results suggest that in long-context distributed serving, accuracy should be treated as a first-class systems objective.