How an extra training step can unlock AI’s reasoning power

What happens between pre-training and post-training matters a lot more than people may realize, new IBM research shows.

For years, the basic recipe for building a capable large language model was straightforward: train a model on mountains of text, then teach it to respond in a helpful, humanlike way through reinforcement learning. At some point, an intermediate training phase was added in, with a heavy focus on math, code, and science, and the reasoning capabilities of LLMs seemed to take a giant leap.

This stage is now referred to as mid-training. Today it’s a routine, if mysterious, step in training today’s reasoning models to do things like rooting out mistakes in complex code bases, lengthy contracts, or financial statements. A new IBM study explains why mid-training so effective, in the first large-scale, systematic look at mid-training in open-source LLMs.

Through more than 500 controlled experiments, IBM researchers found that mid-training boosted overall reasoning capabilities in models of varying sizes and architectures by 3 to 4 times, while preserving knowledge gained during pre-training. Models that skipped this extra step and trained on the same math and science knowledge via reinforcement learning (RL), during post-training, only saw limited improvement.

“Mid-training and reinforcement learning are not interchangeable stages,” said the study’s lead author, Bharat Runwal, an IBM researcher who works on the team behind IBM’s Granite family of models. “They operate through fundamentally different mechanisms, and each does something the other cannot.”

Runwal and his colleagues compared open-source base models drawn from four model families — IBM Granite, Mistral, and Meta’s LLaMA and NVIDIA’s Nemotron-H models — ranging from 3 billion to 24 billion parameters in size. They also tested a traditional transformer architecture and a hybrid design combining a transformer’s attention mechanism with newer recurrent-style processing. Across The benchmarks included the notoriously difficult Google-Proof Question & Answer (GPQA)-Diamond and the American Invitational Mathematics Examination (AIME) which test PhD-level proficiency in science and math.six reasoning benchmarks, models trained under an optimal mid-training pipeline scored an average of 29 to 42 points higher than models trained on the same data via RL.

Researchers have applied the mid-training recipe and pipeline outlined in the study to the next IBM Granite models out soon. IBM has also open-sourced the pipeline for the community to use, prompting several shout-outs on Twitter when the paper became public last month. “Tons of practical tricks for doing mid-training properly,” wrote Cameron Wolfe, a staff research scientist at Netflix and author of the Deep (Learning) Focus blog. “Great read for anyone interested in adapting OSS models to specialized use cases!”

Teaching models to reason, and not just answer

Mid-training as a concept goes back to 2024, though it wasn’t always called that. Some model developers inserted a ‘cool-down’ step at the end of pre-training to extend the model’s context length and working memory so it could process more information in a prompt. Others tacked on a data annealing step during post-training to integrate high-quality domain knowledge into the model.

The modern definition of mid-training includes both data annealing and context-length extension, and as its name suggests, it sits squarely between pre-training, when a model ingests billions to trillions of words and parts of words called tokens, and post-training, when its behavior is shaped by high-quality domain-specific data and human interaction.

The researchers drew their data for mid-training from math problems, coding challenges, and science reasoning datasets and kept to a budget of 27 billion tokens — small by the standard of pre-training, which can stretch to 15 trillion tokens or more. Their goal was to figure out the ideal data mixture, when to apply it, and whether mid-training would help or hinder the reinforcement learning step after. They found that data mixture matters, especially for mid-trained models. Switching the mid-training recipe from just math and code to math, code, and science increased overall reasoning performance by three to six points on average, while the same adjustment during reinforcement learning produced negligible gains.

The effect was even greater for scientific reasoning. Models mid-trained on science data unlocked 17 to 28 more points on the GPQA-Diamond benchmark than models fine-tuned on the same data. The team’s research suggests that scientific reasoning should be added during mid-training for it to be fully exploited later.

Mid-training also seems to change how models tackle difficult math problems. Pre-trained models provided terse answers to MATH500 problems but after mid-training, they showed their work, step by step, in long responses. Not surprisingly, their accuracy scores shot up — Granite-3.3-8B went from 16.9% to 79.5% after mid-training plus RL. "Mid-training teaches models to reason, not just answer," said Ashish Agrawal, a researcher who works on IBM Granite and contributed to the study.

Model	Stage	Pass Rate	Response length
Granite-3.3-8B	Base	16.9%	120 tokens
	mid-training	75.5%	2,254 tokens
	RL	79.5%	1,700 tokens
LLaMA-3.1-8B	Base	2.6%	158 tokens
	mid-training	43.1%	1,052 tokens
	RL	64.6%	1,188 tokens
Nemotron-H-8B	Base	66.6%	452 tokens
	mid-training	61.6%	1,928 tokens
	RL	83.0%	1,780 tokens

There is also evidence that mid-training can help a model push past its competency level during RL training. Granite-3.3-8B gradually learned to solve difficult math and code problems that initially stumped it at the start of RL training, suggesting that RL can unlock new capabilities in models properly mid-trained.

Mid-training is most effective, the researchers found, when applied after a model has been trained to process long sequences of text, rather than at earlier stages of pre-training. Since most open-source base models go through long-context extension before their release, mid-training is a natural next step for developers.

Mid-training and RL work at different scales

If the paper has one takeaway, it’s that you should not skip mid-training. RL training cannot take its place, but proper mid-training can amplify RL’s effects. “You need to get mid training right if you want to build an effective reasoning model,” said Runwal.

Through deep investigation, the researchers discovered why. Using an ablation study, a kind of MRI for LLMs, the researchers explored how mid-training and reinforcement learning can change a model’s structure and internal representations. They found that the two stages operate under fundamentally different but complementary mechanisms, one improving the model with broad brushstrokes, and the other, making detailed adjustments.

Mid-training restructures more than 90% of a model's weights, with changes distributed broadly across a model’s layers and components. Reinforcement learning, by contrast, modifies only about 5% of parameters, and these changes are front-loaded during the first 200 to 400 training steps. RL applies nearly identical weight changes, regardless of whether mid-training preceded it.

A similar story emerged when researchers looked at how similarly models represented information at each stage of the pipeline, using a technique called centered kernel alignment. They found that after RL, a model's internal representations closely resembled its mid-trained checkpoint. RL seems to work within the space mid-training creates, improving a model without altering the geometry that mid-training established.

Many of today’s LLMs have broken out of chat and into the world, where they can call APIs and carry out real-world tasks. The race is on to develop new ways to improve their reasoning capabilities further. But without a solid mid-training foundation, the study suggests, those techniques may have limited impact.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

Notes

Note 1: The benchmarks included the notoriously difficult Google-Proof Question & Answer (GPQA)-Diamond and the American Invitational Mathematics Examination (AIME) which test PhD-level proficiency in science and math. ↩︎

How to run AI workloads on mixed GPUs quickly and affordably
News
Kim Martineau
23 Jun 2026
Can LLMs discover quantum error correction codes?
Research
Matthew Marwick
11 Jun 2026
- AI
- Quantum
Bringing the power of semantic AI to IBM Db2
Technical note
Donna Dillenberger, Prabhakar Kudva, Apoorva Nitsure, Petr Novotny, and Hong Min
08 Jun 2026
Building AI more like software
Release
Mike Murphy
04 Jun 2026
- AI
- Generative AI

Teaching models to reason, and not just answer

Mid-training and RL work at different scales

Notes

Related posts

How to run AI workloads on mixed GPUs quickly and affordably

Can LLMs discover quantum error correction codes?

Bringing the power of semantic AI to IBM Db2

Building AI more like software