Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

Cheng, Qianjia; Zhang, Yuchen; Wang, Zhilin; Zuo, Yuxin; Zhang, Shunkai; Fan, Yuchen; Qiao, Yu; Zhou, Bowen; Ding, Ning; Cheng, Yu; Luo, Yun; Cui, Ganqu

Tool-Integrated Reasoning

Teaching Thinking Models to Reason with Tools

A Full-Pipeline Recipe for Tool-Integrated Reasoning

Qianjia Cheng^1,2* Yuchen Zhang^2,3* Zhilin Wang^2,6 Yuxin Zuo⁵ Shunkai Zhang^2,3 Yuchen Fan⁷ Yu Qiao² Bowen Zhou^2,5 Ning Ding^5† Yu Cheng^2,4† Yun Luo^2† Ganqu Cui^2†

¹Zhejiang University ²Shanghai AI Laboratory ³Peking University ⁴The Chinese University of Hong Kong ⁵Tsinghua University ⁶University of Science and Technology of China ⁷Shanghai Jiao Tong University

arXiv

Our models and dataset are coming soon.

AIME 2025 Leaderboard

Tool-integrated reasoning Text-only reasoning

<10B Scale

~30B Scale

Abstract

Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls.

Case study comparing late verification with interleaved tool use — **The same problem, two policies for invoking the tool.** Grey boxes denote text-only reasoning; In[k] cells indicate tool calls, and Out[k] cells represent tool responses. **Left:** Qwen3-30B-Thinking-2507 treats the Python sandbox as a final-pass verifier. After a text-only Burnside derivation yields the inconsistent value 2420/32 = 75.625, it makes a single late-stage call that hard-codes the flawed assumptions (e.g. `reflections_fixed = 16*C[8]`). A silent integer floor masks the error, leading to the incorrect result 75 ✗. **Right:** TRICE-30B interleaves textual reasoning with code execution, feeding each intermediate result back into the same Burnside framework, and correctly obtains 88 ✓.

In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.

Method

A Full-Pipeline Recipe for TIR

Teaching a strong thinking model to reason with tools without sacrificing text-only reasoning requires a systematic recipe spanning data preparation, SFT, the transition from SFT to RL, and RL itself.

01

Data Engineering for TIR SFT

Curate learnable TIR supervision while preserving text-only reasoning ability.

Teacher Model Selection

Teacher selection should account for the learnability of tool-use patterns, not teacher accuracy alone.

Problem Selection

Tool-advantaged problems better elicit useful tool-use trajectories from the teacher.

Trajectory Composition

Mix text-only trajectories into the TIR set to preserve the student's native reasoning capability.

Overlong Filtering

Filtering overlong teacher trajectories improves downstream RL efficiency and reduces length imitation.

02

Stage Coordination: From SFT to RL

A principled execution of fine-tuning is needed to fully unlock TIR within the holistic training pipeline.

TIR SFT Dynamics

During TIR SFT, what the student learns evolves across form, substance, and noise.

Checkpoint Selection

We identify RL-ready SFT checkpoints by tracking pass@k performance and rollout length.

Stable RL Training

Using on-policy rollouts together with rollout routing replay is simple but necessary.

Main Results

State-of-the-art Tool-Integrated Reasoning

Model	Tool	AIME25	HMMT25	BeyondAIME	IMOAnswerBench	APEX25	Avg.
<10B scale
Qwen3-4B-Thinking-2507	✗	82.5	68.8	54.3	57.0	2.8	58.2
Qwen3.5-4B	✗	75.8	72.9	58.8	59.5	0.0	60.6
Qwen3.5-9B	✗	85.8	82.1	67.3	65.0	0.0	67.2
TRICE-4B	✗	79.2	71.3	58.5	61.0	5.6	61.7
ASTER-4B^†	✓	90.0	77.1	61.7	--	--	--
AgentMath-8B^†	✓	84.7	71.3	--	--	--	--
TRICE-4B	✓	96.7	86.7	71.3	68.9	13.9	72.2
~30B scale
Qwen3-30B-A3B-Instruct-2507	✗	67.5	55.8	51.3	52.3	2.8	52.5
Qwen3-30B-A3B-Thinking-2507	✗	88.8	75.6	65.9	66.1	0.0	67.1
Qwen3.5-35B-A3B	✗	94.2	85.8	72.5	73.8	0.0	74.7
TRICE-30B	✗	89.2	81.7	71.0	72.3	0.0	72.8
GPT-OSS-20B	✓	86.7	83.3	63.0	60.0	8.3	63.4
GLM-4.7-Flash	✓	95.0	84.2	76.0	68.3	11.1	71.6
Nemotron-3-Nano-30B-A3B	✓	96.7	90.4	80.0	77.0	11.1	78.8
AgentMath-30B-A3B^†	✓	86.4	73.8	--	--	--	--
GLM-4.7-Flash w/ recipe	✓	98.3	89.6	81.0	78.8	13.9	80.3
TRICE-30B	✓	99.2	92.5	82.5	80.3	16.7	81.9

Performance comparison on competition-level mathematical benchmarks. Results are accuracy (%) under each model's indicated inference setting. Models marked with † are reported by concurrent TIR systems under their original protocols; all other results follow our unified protocol. GLM-4.7-Flash w/ recipe shows that the pipeline is not specific to a single model family and can further improve models with native TIR ability.

Generalization

Math-trained TIR Generalizes Beyond Math

Model	Tool	FrontierScience	GPQA-Diamond	LiveCodeBench
Qwen3-4B-Thinking-2507	✗	27.5	64.4	51.4
TRICE-4B	✓	42.0 +14.5	68.8 +4.4	55.6 +4.2
Qwen3-30B-A3B-Thinking-2507	✗	44.9	71.2	61.5
TRICE-30B	✓	53.0 +8.1	75.4 +4.2	73.2 +11.7

Analysis

What TIR Unlocks

Model	Tool	AIME25	HMMT25	BeyondAIME	IMOAnswerBench	APEX25
Qwen3-235B-A22B-Thinking	✗	90.8	88.8	71.8	73.8	5.56
DeepSeek-V3.2-Thinking	✗	96.67	90.8	76.8	75.0	0.0
TRICE-30B	✓	99.2	92.5	82.5	80.3	16.7

TRICE with tools surpasses substantially larger text-only reasoning models, suggesting that parameter scaling alone cannot replicate what TIR unlocks. Beyond final arithmetic, code serves as a cognitive tool for discovery and search.

Code roles: empirical discovery, algorithmic search, computation offloading, and conjecture verification

BibTeX

@misc{cheng2026teachingthinkingmodelsreason,
      title={Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning}, 
      author={Qianjia Cheng and Yuchen Zhang and Zhilin Wang and Yuxin Zuo and Shunkai Zhang and Yuchen Fan and Yu Qiao and Bowen Zhou and Ning Ding and Yu Cheng and Yun Luo and Ganqu Cui},
      year={2026},
      eprint={2605.06326},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.06326}, 
}