Workshop paper

Step-Tagging Early-Stopping: Toward controlling the generation of Language Reasoning Models through black-box step monitoring

Abstract

Language Reasoning Models (LRMs) have shown impressive performance on solving complex problems requiring multi-steps. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and self-reflection steps. To address this challenge, we introduce the Step-Tagging Early-Stopping (ST-ES) framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. We show that limiting the count of specific step-type - especially verification and self-reflection steps - yields a more accurate and token-efficient early-stopping criterion than token-count baseline, and that each step-types yield to a different efficiency trade-off. Unlike prior dynamic early-stopping methods, ST-ES operates in a full black-box setting, and offers interpretable early-stopping criteria. We evaluate ST-ES on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.