How Much Can Code Models Benefit From Tool Use?

The aim of this post is to attempt to answer one of the open questions from yesterday’s project – why deal with the added complexity of RL over SFT?. Taking a skeptical viewpoint, can we motivate RL for code models?

Another way to pose this question is: given access to open source models, what’s the most inference-cost-efficient way to hit a certain level on HumanEval? What if we theoretically had to train these models from scratch? Would TIR be the best use of compute in any of these cases? How valuable is the signal the model gets from the traces?

Our RL training costs are negligible

Start with ~1.5b parameter pretrained base model: Qwen2.5-Coder-1.5B-Instruct

Training data: open-r1/verifiable-coding-problems-python Test data: humaneval

Concretely: what do training costs, inference costs, HumanEval performance look like for each of these model versions? - baseline pretrained —> HumanEval: 43.9, training cost: 5e22 FLOPs, inference cost per question: ~3e12 FLOPs - baseline + SFT —> HumanEval: 70.7, training cost: <5.1e22 FLOPs, inference cost: ~3e12 FLOPs — match this (note: import helper + prepend function + temperature settings to get from 62% to 70%) - baseline + SFT + regenerate when doesn’t compile until compiles —> HumanEval: [doesn’t work because increasing temperature makes generations worse, even though they all compile and most generations already compiled]

Aside infrastructure back and forth

What is a reasonable number of tokens / sec to expect from inference? ~3k for 7b model on an H100. Maybe ~8k for 1.5b model on an H100 is reasonable. If get >40k Tok / sec on 7 GPUs, that’s reasonable. Don’t need perfect inference, just need reasonable (not multiple OOM worse than optimal) What is the go-to way to train multi-turn GRPO? Seems like not a clear way, get something working

RL on contest questions directly

tool integrated reasoning with RL (training on top of instruct model on my training dataset)

get GRPO + vllm working with scramble task to debug GRPO training loop w modal

switch training dataset with filtering

get GRPO + vllm working with humaneval task

results –> does the model learn to reason for longer?

RL with tool integrated reasoning

with code execution – I definitely expect an improvement

ToRL style

…

Aside infrastructure back and forth

RL on contest questions directly

RL with tool integrated reasoning

ToRL style

Open Questions