CodeQ: Teaching an LLM to Debug Code with MCTS and DPO
How a single architectural decision — switching from line-level edits to full rewrites — took fix rate from 10% to 81.3% on DebugBench. Plus: the 81% data duplication discovery, a bf16 NaN fix, and what DPO did and didn't transfer.
1. Introduction
Automated code debugging is hard. Not because current LLMs can't write code — they can. The problem is that debugging requires a search process: generate a fix, test it, observe the result, revise, repeat. A single forward pass through a language model doesn't do that. You need exploration.
CodeQ combines two ideas: Monte Carlo Tree Search (MCTS) to systematically explore fix strategies, and Direct Preference Optimization (DPO) to let the model learn from its own exploration data. The result is a self-improving debugging agent that gets better over time without any human labeling.
The core inspiration is Agent Q (Putta et al., 2024), which applied MCTS + DPO to web navigation. We adapt the same loop to code: explore bugs with MCTS, extract preference pairs from the trajectories, train with DPO, repeat.
One-line result: 81.3% fix rate on DebugBench (100/123 unique bugs), up from a 10% apparent baseline before the critical architecture refactor. After DPO Round 2, MCTS mode reaches 84%.
2. Architecture Overview
The system runs across two machines connected via SSH and scp:
- Machine A (inference): Qwen2.5-Coder-7B-Instruct in 4-bit quantization via bitsandbytes (~4–6 GB VRAM). Handles all MCTS rollouts.
- Machine B (training): Full bf16 model with LoRA adapters (~30–35 GB VRAM on an H100 94GB). Runs DPO training.
LoRA adapters are transferred from Machine B to Machine A via scp after each training round, enabling a pipelined workflow where exploration and training can overlap across rounds.
MCTS Engine
At each node in the tree, the model generates K=4 candidate fixes at temperature 0.8. A critic (same model, temperature 0.2) scores each candidate on correctness, clarity, and plausibility. UCB1 selects which node to expand next, balancing exploitation of promising fixes with exploration of alternatives.
Each candidate fix is executed in a Docker sandbox (no network, 512 MB RAM, 30-second timeout). Pass/fail from the sandbox is the ground truth signal.
DPO Training Loop
Preference pairs are extracted from MCTS trajectories: winning fixes (those that passed tests) are the "chosen" completions; failed attempts are "rejected." Blended Q-values (α=0.5 × MCTS value + AI critic score) rank them.
We use off-policy DPO with pre-computed reference log-probabilities to decouple the preference data collection from training. LoRA config: rank=32, alpha=64, all attention + MLP layers, lr=5e-6, 2 epochs, β=0.1.
3. The Critical Refactor: Line Edits → Full Rewrites
The original CodeQ design used a structured action space inspired by SWE-bench-style agents: PLAN, THOUGHT, and CODE_ACTION tokens, with line-level edit operations — EDIT (replace lines), INSERT (add lines), DELETE (remove lines), and RUN_TESTS.
This seemed principled. In practice, it was a disaster.
The model frequently generated malformed structured outputs: mismatched line numbers, incorrect action syntax, truncated edits. The parser failed silently on many of these, counting them as unsuccessful fixes. Our apparent fix rate was 0.38%. Something was clearly wrong.
Discovery: The 0.38% apparent rate was a parsing artifact. When we added explicit fallback detection, the true base rate on correctly-parsed outputs was ~36%. But the parsing overhead was still killing throughput — MCTS with line edits ran ~14× slower than the rewrite approach.
The fix was to switch the entire action space to full_rewrite: the model outputs a complete replacement for the buggy function. No structured parsing, no line numbers, no edit operations. Just the fixed code, wrapped in a simple code fence.
Results after the refactor:
- MCTS baseline (full_rewrite, no DPO): 81.3% fix rate (100/123)
- Full_rewrite baseline (single pass, no MCTS): 43.9% (54/123)
- Speed: ~14× faster per rollout than the line-edit approach
The lesson is not that full rewrites are always better than line edits. The lesson is that action space design matters more than model quality. A simpler, more robust action space that the model can reliably execute will outperform a more expressive action space that introduces parse failures.
The Rewrite Prompt
The generation prompt is deliberately minimal:
You are a debugging assistant. Below is a Python function that contains a bug.
Your task: output a complete, corrected version of the function.
Output ONLY the corrected function code, wrapped in ```python ... ```.
Do not include any explanation.
Buggy function:
{buggy_code}
Test cases that must pass:
{test_cases}
The critic prompt is similarly direct, asking for a score from 0–1 and a brief rationale.
4. Data Quality: The 81% Duplication Discovery
During preprocessing of DebugBench, we found that approximately 81% of the dataset entries were duplicates. The raw download contained ~650 problem instances; after deduplication, only ~123 unique problems remained.
This matters for two reasons:
- Inflated metrics: Any model that memorized training distribution would appear to perform much better on the duplicated set than on the unique problems.
- Evaluation integrity: Fix rates reported on the full (duplicated) set are not comparable to fix rates on the deduplicated set.
All CodeQ results are reported on the deduplicated set of 123 unique problems. When comparing to other systems, verify whether they used the deduplicated or raw dataset.
Rule: Always audit your benchmark before reporting results. The 81% duplication rate was not obvious from the dataset documentation — it required computing pairwise similarity across problem statements and test cases.
5. DPO Training: What Worked and What Broke
Round 1 DPO
Round 1 DPO ran without incident. We extracted ~400 preference pairs from the MCTS trajectories, trained for 2 epochs, and saw modest improvement on held-out problems.
Round 2: The bf16 NaN Collapse
Round 2 DPO training collapsed immediately: all losses went to NaN within the first 100 steps. Gradient norms were normal; learning rate was the same; data distribution was similar.
The culprit: bf16 overflow in logit computation during DPO's log-probability calculation. When reference and policy log-probabilities diverge significantly after Round 1 fine-tuning, logit values before softmax can exceed bf16's representable range (~65504), producing NaN.
The fix was a custom trainer subclass that upcasts logits to fp32 before computing log-probabilities:
class Fp32LogitsDPOTrainer(DPOTrainer):
def concatenated_forward(self, model, batch):
outputs = super().concatenated_forward(model, batch)
# Upcast logits to fp32 to prevent bf16 overflow
if outputs.get("logits") is not None:
outputs["logits"] = outputs["logits"].float()
return outputs
A second issue: TRL 1.0.0 broke the precompute_ref_log_probs flag, causing reference log-probs to be recomputed on every step instead of once. This made training ~3× slower and produced incorrect DPO gradients. Fix: pin TRL to 0.29.1.
# requirements.txt
trl==0.29.1 # 1.0.0 broke precompute_ref_log_probs
Preference Extraction
Preference pairs are extracted by comparing trajectories within the same MCTS tree. For each bug, we identify the best fix (highest Q-value among fixes that passed tests) and the best non-fix (highest Q-value among attempts that failed tests). The Q-value blending:
def compute_q_value(mcts_value: float, critic_score: float, alpha: float = 0.5) -> float:
return alpha * mcts_value + (1 - alpha) * critic_score
6. Results
| Configuration | Fix Rate |
|---|---|
| Single-pass full_rewrite (no MCTS) | 43.9% (54/123) |
| MCTS + full_rewrite (base model) | 81.3% (100/123) |
| MCTS + full_rewrite (+ DPO Round 2) | 84.0% (42/50) |
| Single-pass full_rewrite (+ DPO Round 2) | 43.9% — no transfer |
| Pre-refactor apparent baseline (line edits) | ~0.38% (parse failures) |
DPO Transfer: The Honest Finding
DPO Round 2 improves MCTS mode from 81.3% to 84%. This is the positive result.
DPO Round 2 does not transfer to single-pass full_rewrite mode. The fix rate stays at 43.9% — identical to the base model. Zero transfer.
The explanation is straightforward: the DPO training data consists entirely of MCTS trajectories. The model learns to prefer good fixes when given MCTS-style multi-step reasoning context. It does not learn to generate better single-pass rewrites, because that behavior was never in the training signal.
This is not a failure. It's an informative result about how DPO specialization works in self-improvement loops. The model improves on the task distribution it trained on, not on adjacent tasks.
7. What I'd Do Differently
Four things worth trying in a Round 3+:
- Train DPO on full_rewrite trajectories (not just MCTS trajectories) to test whether single-pass transfer is possible.
- Run Round 3 to measure diminishing returns. The 81% → 84% improvement from Round 1 to 2 is modest. It's unclear whether Round 3 would hit a ceiling or keep climbing.
- Cross-benchmark generalization. All results are on DebugBench. SWE-bench Lite would be a stronger test of whether MCTS-trained fix strategies generalize.
- Better preference extraction. The current blended Q-value (α=0.5 MCTS + critic) is a heuristic. A learned value function trained on execution outcomes would be more principled.
The honest finding about no DPO transfer is as interesting as the 81.3% fix rate. It tells you something real about the limits of self-improvement via trajectory-based preference learning — and it points directly at what to fix next.
Links
- GitHub: github.com/tathadn/codeq
- Inspired by: Agent Q — Putta et al., 2024