Published
CodeQ: Teaching an LLM to Debug Code with MCTS and DPO
How to build a self-improving code debugging agent: MCTS exploration, dual-temperature critique, DPO training from the model's own rollouts — and the critical refactor that went from 10% to 81.3% fix rate on DebugBench. Includes the 81% data duplication discovery, the bf16 NaN fix, and an honest accounting of what DPO did and didn't transfer.