CodeQ: Teaching an LLM to Debug Code with MCTS and DPO

1. Introduction

Automated code debugging is hard. Not because current LLMs can't write code — they can. The problem is that debugging requires a search process: generate a fix, test it, observe the result, revise, repeat. A single forward pass through a language model doesn't do that. You need exploration.

CodeQ combines two ideas: Monte Carlo Tree Search (MCTS) to systematically explore fix strategies, and Direct Preference Optimization (DPO) to let the model learn from its own exploration data. The result is a self-improving debugging agent that gets better over time without any human labeling.

The core inspiration is Agent Q (Putta et al., 2024), which applied MCTS + DPO to web navigation. We adapt the same loop to code: explore bugs with MCTS, extract preference pairs from the trajectories, train with DPO, repeat. We evaluate on DebugBench, a benchmark of Python debugging problems spanning syntax errors, logic bugs, reference errors, and multi-fault scenarios. The benchmark provides buggy functions paired with test suites — a clean signal for automated evaluation without human judgment.

One-line result: 81.3% fix rate on 123 unique DebugBench problems (after discovering and removing 81% data duplication), with DPO pushing MCTS mode to 84% — specifically on hard problems. Full code at github.com/tathadn/codeq.

2. Architecture Overview

The system runs across two machines connected via SSH and scp:

Machine A (inference): Qwen2.5-Coder-7B-Instruct in 4-bit quantization via bitsandbytes (~4–6 GB VRAM). Handles all MCTS rollouts.
Machine B (training): Full bf16 model with LoRA adapters (~30–35 GB VRAM on an H100 94GB). Runs DPO training.

LoRA adapters are transferred from Machine B to Machine A via scp after each training round, enabling a pipelined workflow where exploration and training can overlap across rounds.

Self-Improvement Loop

The full training cycle runs in five stages:

MCTS Search: Machine A runs MCTS rollouts against DebugBench problems, generating candidate fixes and testing them in Docker sandboxes.
Trajectory Collection: Winning fixes (passed tests) and losing attempts (failed tests) are logged with their blended Q-values.
DPO Training: Machine B extracts preference pairs from trajectories and trains the model with LoRA-based DPO.
Policy Update: Trained LoRA adapters are transferred back to Machine A via scp and merged into the inference model.
Evaluate and Repeat: The updated model is evaluated on DebugBench. If performance improves, the cycle repeats with the updated policy generating new, potentially better trajectories for the next round of DPO training.

MCTS Engine

At each node in the tree, the model generates K=4 candidate fixes at temperature 0.8. A critic (the same model at temperature 0.2) scores each candidate on correctness, clarity, and plausibility. The lower temperature makes the critic more deterministic and consistent in its evaluations, while the higher generation temperature encourages diverse fix strategies.

UCB1 selects which node to expand next, balancing exploitation of high Q-value fixes with exploration of low-visit-count alternatives. The formula trades off known-good nodes against under-explored ones — without this balance, the search would collapse into greedy single-path behavior and lose the diversity that makes MCTS valuable for debugging.

Each candidate fix is executed in a Docker sandbox with strict isolation: no network access, 512 MB RAM limit, and a 30-second timeout. This prevents generated code from making network calls, consuming unbounded memory, or running infinite loops. Pass/fail from the sandbox — determined by running the provided test suite — is the ground truth signal that feeds back into the MCTS tree as the reward.

DPO Training Loop

Preference pairs are extracted from MCTS trajectories: winning fixes (those that passed tests) are the "chosen" completions; failed attempts are "rejected." Blended Q-values (α=0.5 × MCTS value + AI critic score) rank them, combining the tree-search signal with the critic's assessment to produce a more robust preference ordering than either signal alone.

We use off-policy DPO with pre-computed reference log-probabilities. The key trick: instead of loading a second copy of the model as the reference policy (which would require another ~14 GB VRAM), we compute all reference log-probs in a single forward pass before training starts and cache them. This decouples data collection from training and keeps the GPU memory footprint manageable on a single H100. LoRA config: rank=32, alpha=64, targeting all attention and MLP layers, lr=5e-6, 2 epochs, β=0.1.

3. The Critical Refactor: Line Edits → Full Rewrites

The original CodeQ design used a structured action space inspired by SWE-bench-style agents: PLAN, THOUGHT, and CODE_ACTION tokens, with line-level edit operations — EDIT (replace lines), INSERT (add lines), DELETE (remove lines), and RUN_TESTS.

This seemed principled. In practice, it was a disaster.

The model frequently generated malformed structured outputs: mismatched line numbers, incorrect action syntax, truncated edits. The parser failed silently on many of these, counting them as unsuccessful fixes. Our apparent fix rate was 0.38% — but this was a parsing artifact. When we added explicit fallback detection, the true base rate on correctly-parsed outputs was ~36%. But the parsing overhead was still killing throughput.

Discovery: Switching to full_rewrite mode eliminated all parsing failures and revealed the model's true debugging capability. MCTS with full rewrites ran ~14× faster per rollout because it eliminated the multi-step edit-parse-apply cycle entirely.

The fix was to switch the entire action space to full_rewrite: the model outputs a complete replacement for the buggy function. No structured parsing, no line numbers, no edit operations. Just the fixed code, wrapped in a simple code fence.

Results after the refactor:

MCTS baseline (full_rewrite, no DPO): 81.3% fix rate (100/123)
Full_rewrite baseline (single pass, no MCTS): 43.9% (54/123)
Speed: ~14× faster per rollout than the line-edit approach

The lesson is not that full rewrites are always better than line edits. The lesson is that action space design matters more than model quality. A simpler, more robust action space that the model can reliably execute will outperform a more expressive action space that introduces parse failures.

The Rewrite Prompt

The generation prompt is deliberately minimal:

You are a debugging assistant. Below is a Python function that contains a bug.
Your task: output a complete, corrected version of the function.
Output ONLY the corrected function code, wrapped in ```python ... ```.
Do not include any explanation.

Buggy function:
{buggy_code}

Test cases that must pass:
{test_cases}

The critic prompt is similarly direct, asking for a score from 0–1 and a brief rationale.

4. Data Quality: The 81% Duplication Discovery

During preprocessing of DebugBench, we found that approximately 81% of the dataset entries were duplicates. The raw download contained ~650 problem instances; after deduplication, only ~123 unique problems remained.

This matters for two reasons:

Inflated metrics: Any model that memorized training distribution would appear to perform much better on the duplicated set than on the unique problems.
Evaluation integrity: Fix rates reported on the full (duplicated) set are not comparable to fix rates on the deduplicated set.

All CodeQ results are reported on the deduplicated set of 123 unique problems. When comparing to other systems, verify whether they used the deduplicated or raw dataset.

Rule: Always audit your benchmark before reporting results. The 81% duplication rate was not obvious from the dataset documentation — it required computing pairwise similarity across problem statements and test cases.

5. DPO Training: What Worked and What Broke

Round 1 DPO

Round 1 DPO ran without incident. We extracted ~400 preference pairs from the MCTS trajectories, trained for 2 epochs with LoRA on Machine B, and saw modest improvement on held-out problems. The training was stable, loss curves were smooth, and the adapter transferred cleanly back to Machine A for evaluation.

Round 2: The bf16 NaN Collapse

Round 2 DPO training collapsed immediately: all losses went to NaN within the first 100 steps. Gradient norms were normal; learning rate was the same; data distribution was similar.

The culprit: bf16 overflow in logit computation during DPO's log-probability calculation. When reference and policy log-probabilities diverge significantly after Round 1 fine-tuning, logit values before softmax can exceed bf16's representable range (~65504), producing NaN.

The fix was a custom trainer subclass that upcasts logits to fp32 before computing log-probabilities:

class Fp32LogitsDPOTrainer(DPOTrainer):
    def concatenated_forward(self, model, batch):
        outputs = super().concatenated_forward(model, batch)
        # Upcast logits to fp32 to prevent bf16 overflow
        if outputs.get("logits") is not None:
            outputs["logits"] = outputs["logits"].float()
        return outputs

A second issue: TRL 1.0.0 broke the precompute_ref_log_probs flag, causing reference log-probs to be recomputed on every step instead of once. This made training ~3× slower and produced incorrect DPO gradients. Fix: pin TRL to 0.29.1.

# requirements.txt
trl==0.29.1  # 1.0.0 broke precompute_ref_log_probs

Preference Extraction

Preference pairs are extracted by comparing trajectories within the same MCTS tree. At each internal tree node, we compare sibling trajectories. The fix with the highest blended Q-value (α=0.5 × MCTS value + AI critic score) that passed tests becomes "chosen"; the highest-scoring failure becomes "rejected." We filter pairs where |Q_chosen - Q_rejected| < 0.2 to avoid learning from noise — pairs with near-identical scores provide a weak preference signal that can destabilize training.

The Q-value blending:

def compute_q_value(mcts_value: float, critic_score: float, alpha: float = 0.5) -> float:
    return alpha * mcts_value + (1 - alpha) * critic_score

6. Results

Configuration	Fix Rate
Single-pass full_rewrite (no MCTS)	43.9% (54/123)
MCTS + full_rewrite (base model)	81.3% (100/123)
MCTS + full_rewrite (+ DPO Round 2)	84.0% (42/50)
Single-pass full_rewrite (+ DPO Round 2)	43.9% — no transfer
Pre-refactor apparent baseline (line edits)	~0.38% (parse failures)

The main results tell a clear story: MCTS search is the dominant factor, nearly doubling the fix rate from 43.9% to 81.3%. DPO adds an incremental improvement on top, pushing MCTS mode to 84% — a smaller but real gain, especially concentrated on harder problems (see Ablation Studies below).

DPO Transfer: The Honest Finding

DPO Round 2 improves MCTS mode from 81.3% to 84%. This is the positive result.

DPO Round 2 does not transfer to single-pass full_rewrite mode. The fix rate stays at 43.9% — identical to the base model. Zero transfer.

The explanation is straightforward: the DPO training data consists entirely of MCTS trajectories — multi-step fix attempts where the model iteratively refines solutions across tree branches. The model learns to prefer good fixes when given this MCTS-style multi-step reasoning context. It does not learn to generate better single-pass rewrites, because single-pass generation behavior was never represented in the training signal. The policy shift from DPO is specific to the distribution it trained on.

This is not a failure — it's an informative result about how DPO specialization works in self-improvement loops. The model improves on the task distribution it trained on, not on adjacent tasks. It also suggests a clear next experiment: train DPO on single-pass trajectories to test whether transfer in the other direction is possible.

7. Ablation Studies

The top-line 81.3% fix rate obscures important variation across bug types, difficulty levels, and compute budgets. Three ablations break down where the gains actually come from.

7.1 By Bug Category

Category	Rewrite (base)	MCTS (base)	MCTS (+ DPO)
Syntax	61.9%	95.0%	95.0%
Logic	45.8%	90.0%	85.0%
Reference	55.9%	80.0%	80.0%
Multiple	31.8%	90.0%	85.0%

MCTS saturates on syntax errors (95%) where the search space is narrow — there are relatively few valid rewrites for a missing semicolon or mismatched bracket, so even modest search finds the fix. The largest gains appear on multiple-fault bugs (31.8% → 90%), where iterative search over full rewrites avoids the combinatorial explosion that makes line-level patching intractable. With multiple bugs present, line-edit approaches must independently locate and fix each bug, and a failure on any one prevents passing the test suite. Full rewrites sidestep this by regenerating the entire function coherently.

Interestingly, DPO slightly reduces performance on logic and multiple-fault categories (90% → 85%), suggesting the DPO policy may be over-specializing on certain fix patterns at the expense of exploration diversity. When the preference data is dominated by simpler fixes (which succeed more often and thus generate more training pairs), the model may learn to prefer conservative rewrites that work well on average but miss the creative solutions needed for complex multi-fault bugs.

7.2 By Difficulty

Difficulty	Rewrite (base → DPO)	MCTS (base → DPO)
Easy	56.8% → 56.8%	90% → 90%
Medium	40.7% → 44.4%	90% → 90%
Hard	34.4% → 34.4%	80% → 85%

DPO improves performance where it matters most: hard problems under MCTS search (80% → 85%). Easy and medium problems are already saturated by MCTS alone — there's no headroom for DPO to add value. The fact that DPO's gains concentrate on hard problems suggests the preference signal from MCTS trajectories is most informative when the search encounters genuine difficulty, producing diverse winning and losing trajectories that the policy can learn from.

7.3 By Rollout Budget

Rollouts	MCTS (base)	MCTS (+ DPO)
1	80%	78%
2	80%	80%
5	80%	82%
10	84%	84%
20	84%	86%

Performance plateaus at ~10 rollouts for the base model — additional search doesn't help once the easy solutions have been found. The base model at 1 rollout already achieves 80%, which means the majority of fixable bugs are solvable on the first or second attempt; the remaining 4% gain (80% → 84%) requires 10× the compute.

The DPO policy shows a slight advantage at higher budgets (86% at 20 rollouts vs. 84% for the base model), suggesting that DPO produces candidates that are more distinguishable under extended search — the trained policy generates fixes that are more differentiated from each other, giving the MCTS tree more useful branches to explore. The practical takeaway: 10 rollouts is the efficiency sweet spot, but if you can afford 20, the DPO-trained policy extracts marginal value from the extra compute.

8. Engineering Findings Summary

For quick reference, here are the key engineering decisions and their impact on the project.

Finding	Impact
Full-rewrite action space	10% → 81.3% solve rate
81% data duplication in DebugBench	Discovered and fixed before Round 2
fp32 logit upcast for DPO	Fixed NaN loss under bf16
TRL pinned to 0.29.1	Avoided breaking changes in 1.0.0
DPO does not transfer to single-pass	Training distribution mismatch

9. What I'd Do Differently

Five things worth trying in a Round 3+:

Train DPO on full_rewrite trajectories (not just MCTS trajectories) to test whether single-pass transfer is possible. The current training data only covers MCTS-style multi-step reasoning, which explains the zero transfer.
Run Round 3 to measure diminishing returns. The 81% → 84% improvement from Round 1 to 2 is modest. It's unclear whether Round 3 would hit a ceiling or keep climbing.
Cross-benchmark generalization. All results are on DebugBench. SWE-bench Lite would be a stronger test of whether MCTS-trained fix strategies generalize to real-world repository-level bugs.
Better preference extraction. The current blended Q-value (α=0.5 MCTS + critic) is a heuristic. A learned value function trained on execution outcomes would be more principled.
Investigate the DPO regression on logic/multiple-fault categories. The ablation shows DPO slightly hurts performance on these categories (90% → 85%). This could indicate the preference signal is biased toward certain fix patterns, or that the blended Q-value threshold isn't filtering noise effectively. A per-category DPO training run would isolate whether this is a data distribution issue or an optimization issue.

The honest finding about no DPO transfer is as interesting as the 81.3% fix rate. It tells you something real about the limits of self-improvement via trajectory-based preference learning — and it points directly at what to fix next.

10. Links

GitHub: github.com/tathadn/codeq
Inspired by: Agent Q — Putta et al., 2024
Benchmark: DebugBench — Tian et al., 2024

Part of a portfolio exploring AI for software quality. Next up: VisionTriage (multimodal bug triage) and Speculative Decoding (inference optimization).

← Back to all posts