Tool-Integrated Reasoning

Teaching Thinking Models to Reason with Tools

A Full-Pipeline Recipe for Tool-Integrated Reasoning

Qianjia Cheng1,2* Yuchen Zhang2,3* Zhilin Wang2,6 Yuxin Zuo5 Shunkai Zhang2,3 Yuchen Fan7 Yu Qiao2 Bowen Zhou2,5 Ning Ding5† Yu Cheng2,4† Yun Luo2† Ganqu Cui2†
1Zhejiang University 2Shanghai AI Laboratory 3Peking University 4The Chinese University of Hong Kong 5Tsinghua University 6University of Science and Technology of China 7Shanghai Jiao Tong University

Our models and dataset are coming soon.

AIME 2025 Leaderboard

Tool-integrated reasoning Text-only reasoning

<10B Scale

~30B Scale

Abstract

Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls.

Case study comparing late verification with interleaved tool use
The same problem, two policies for invoking the tool. Grey boxes denote text-only reasoning; In[k] cells indicate tool calls, and Out[k] cells represent tool responses. Left: Qwen3-30B-Thinking-2507 treats the Python sandbox as a final-pass verifier. After a text-only Burnside derivation yields the inconsistent value 2420/32 = 75.625, it makes a single late-stage call that hard-codes the flawed assumptions (e.g. reflections_fixed = 16*C[8]). A silent integer floor masks the error, leading to the incorrect result 75 . Right: TRICE-30B interleaves textual reasoning with code execution, feeding each intermediate result back into the same Burnside framework, and correctly obtains 88 .

In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.

Method

A Full-Pipeline Recipe for TIR

Teaching a strong thinking model to reason with tools without sacrificing text-only reasoning requires a systematic recipe spanning data preparation, SFT, the transition from SFT to RL, and RL itself.

01

Data Engineering for TIR SFT

Curate learnable TIR supervision while preserving text-only reasoning ability.

Teacher Model Selection

Teacher selection should account for the learnability of tool-use patterns, not teacher accuracy alone.

Problem Selection

Tool-advantaged problems better elicit useful tool-use trajectories from the teacher.

Trajectory Composition

Mix text-only trajectories into the TIR set to preserve the student's native reasoning capability.

Overlong Filtering

Filtering overlong teacher trajectories improves downstream RL efficiency and reduces length imitation.

02

Stage Coordination: From SFT to RL

A principled execution of fine-tuning is needed to fully unlock TIR within the holistic training pipeline.

TIR SFT Dynamics

During TIR SFT, what the student learns evolves across form, substance, and noise.

Checkpoint Selection

We identify RL-ready SFT checkpoints by tracking pass@k performance and rollout length.

Stable RL Training

Using on-policy rollouts together with rollout routing replay is simple but necessary.

Main Results

State-of-the-art Tool-Integrated Reasoning

Model Tool AIME25 HMMT25 BeyondAIME IMOAnswerBench APEX25 Avg.
<10B scale
Qwen3-4B-Thinking-2507 82.5 68.8 54.3 57.0 2.8 58.2
Qwen3.5-4B 75.8 72.9 58.8 59.5 0.0 60.6
Qwen3.5-9B 85.8 82.1 67.3 65.0 0.0 67.2
TRICE-4B 79.2 71.3 58.5 61.0 5.6 61.7
ASTER-4B 90.0 77.1 61.7 -- -- --
AgentMath-8B 84.7 71.3 -- -- -- --
TRICE-4B 96.7 86.7 71.3 68.9 13.9 72.2
~30B scale
Qwen3-30B-A3B-Instruct-2507 67.5 55.8 51.3 52.3 2.8 52.5
Qwen3-30B-A3B-Thinking-2507 88.8 75.6 65.9 66.1 0.0 67.1
Qwen3.5-35B-A3B 94.2 85.8 72.5 73.8 0.0 74.7
TRICE-30B 89.2 81.7 71.0 72.3 0.0 72.8
GPT-OSS-20B 86.7 83.3 63.0 60.0 8.3 63.4
GLM-4.7-Flash 95.0 84.2 76.0 68.3 11.1 71.6
Nemotron-3-Nano-30B-A3B 96.7 90.4 80.0 77.0 11.1 78.8
AgentMath-30B-A3B 86.4 73.8 -- -- -- --
GLM-4.7-Flash w/ recipe 98.3 89.6 81.0 78.8 13.9 80.3
TRICE-30B 99.2 92.5 82.5 80.3 16.7 81.9

Performance comparison on competition-level mathematical benchmarks. Results are accuracy (%) under each model's indicated inference setting. Models marked with † are reported by concurrent TIR systems under their original protocols; all other results follow our unified protocol. GLM-4.7-Flash w/ recipe shows that the pipeline is not specific to a single model family and can further improve models with native TIR ability.

Generalization

Math-trained TIR Generalizes Beyond Math

Model Tool FrontierScience GPQA-Diamond LiveCodeBench
Qwen3-4B-Thinking-2507 27.5 64.4 51.4
TRICE-4B 42.0 +14.5 68.8 +4.4 55.6 +4.2
Qwen3-30B-A3B-Thinking-2507 44.9 71.2 61.5
TRICE-30B 53.0 +8.1 75.4 +4.2 73.2 +11.7

Analysis

What TIR Unlocks

Model Tool AIME25 HMMT25 BeyondAIME IMOAnswerBench APEX25
Qwen3-235B-A22B-Thinking 90.8 88.8 71.8 73.8 5.56
DeepSeek-V3.2-Thinking 96.67 90.8 76.8 75.0 0.0
TRICE-30B 99.2 92.5 82.5 80.3 16.7

TRICE with tools surpasses substantially larger text-only reasoning models, suggesting that parameter scaling alone cannot replicate what TIR unlocks. Beyond final arithmetic, code serves as a cognitive tool for discovery and search.

Code roles: empirical discovery, algorithmic search, computation offloading, and conjecture verification

BibTeX

@misc{cheng2026teachingthinkingmodelsreason,
      title={Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning}, 
      author={Qianjia Cheng and Yuchen Zhang and Zhilin Wang and Yuxin Zuo and Shunkai Zhang and Yuchen Fan and Yu Qiao and Bowen Zhou and Ning Ding and Yu Cheng and Yun Luo and Ganqu Cui},
      year={2026},
      eprint={2605.06326},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.06326}, 
}