Learned Meta-Tokens for Language Modeling

Alok Shah; Khush Gupta; Keshav Ramji; Pratik Chaudhari

ICLR 2026

Conference paper

23 Apr 2026

Learned Meta-Tokens for Language Modeling

Abstract

Transformer-based language models (LMs) notably struggle to reliably capture distant contextual information. This work introduces a novel approach using meta-tokens -- special tokens injected during pre-training -- paired with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model equipped with meta-attention in addition to causal multi-head attention on <100B tokens, achieving strong performance on a suite of synthetic tasks. Our method facilitates length generalization up to 2 $\times$ the context window after extension with YaRN. We provide an information-theoretic analysis which reveals that meta-tokens \textit{sharpen} the positional encoding, allowing them to operate as content-based anchors that compress preceding context and “cache” it within the meta-token. We empirically confirm this by visualizing model internals to study the residual stream. Together, our findings demonstrate that meta-tokens and meta-attention provide a simple, data-efficient pre-training method, grounded by mechanistic insights into their role in enabling length generalization behavior.