Soft-Masked Diffusion Language Models
Michael Hersche, Samuel Moor, et al.
ICLR 2026
Hybrid Mamba-Transformer models have emerged as promising alternatives to Transformers, offering efficiency and competitive performance. However, they struggle to generalize beyond their training context windows, collapsing on long-context tasks. We provide the first systematic analysis of this failure, showing that it arises from uncontrolled state growth and uneven receptive field contributions across the hybrid architecture. Guided by this understanding, we introduce Universal Position Interpolation (UPI), a lightweight, training-free scaling method that unifies Mamba’s cumulative decay with Transformer rotary frequency scaling. UPI selectively stabilizes unstable Mamba dynamics while rescaling Transformer encodings, controlling state growth and enabling reliable long-context generalization, with only a few auxiliary forward passes. Evaluation shows that UPI extends multi- ple state-of-the-art hybrid and pure Mamba models from 4K to up to 64K tokens on PG-19 perplexity, LongBench and RULER benchmarks, without sacrificing short-context accuracy. These findings establish the first principled bridge between context length extension on Transformers and state-space models and open a new direction for training-free context extension methods for emerging hybrid models.
Michael Hersche, Samuel Moor, et al.
ICLR 2026
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010
Davis Wertheimer, Aozhong Zhang, et al.
ICLR 2026
Chen-chia Chang, Wan-hsuan Lin, et al.
ICML 2025