AI/ML Engineering
Foundation model fine-tuning and agentic systems.
Flagship
CodeQ — Autonomous Code Debugging Agent
Qwen2.5-Coder-7B · MCTS · DPO · 2× H100
CodeQ is a self-improving code debugging agent inspired by Agent Q (Putta et al., 2024). It uses Monte Carlo Tree Search to systematically explore fix strategies for buggy code, an AI self-critique mechanism with dual-temperature scoring to rank proposed fixes, and Direct Preference Optimization to teach the model to prefer successful fixes over failed ones — all without human intervention.
The system runs across two NVIDIA H100 nodes: Machine A handles MCTS inference in 4-bit quantization (~4–6 GB VRAM), while Machine B runs DPO training with LoRA in bf16 (~30–35 GB VRAM). LoRA adapters are transferred between machines via scp, enabling a pipelined workflow where exploration and training overlap.
Key Results
| Metric |
Value |
| Pre-refactor line-edit baseline |
~10% (parse failures) |
| Full rewrite baseline |
43.9% (54/123) |
| MCTS rewrite (base model) |
81.3% (100/123) |
| MCTS rewrite (+ DPO Round 2) |
84.0% (42/50) |
| Improvement from rewrite refactor |
10% → 81.3% |
| DPO transfer to full_rewrite mode |
No transfer |
Note: 81% data duplication discovered and fixed in DebugBench dataset during preprocessing.
Qwen2.5-Coder-7B-Instruct
MCTS
DPO
LoRA r=32
bf16
4-bit bitsandbytes
HuggingFace TRL
Flash Attention 2
Docker
2× NVIDIA H100 94GB
W&B
In Progress
VisionTriage — Multimodal Bug Report Triage
Qwen2.5-VL-7B-Instruct · QLoRA · Eclipse/Mozilla · Rico
VisionTriage fine-tunes Qwen2.5-VL-7B-Instruct with QLoRA to automatically triage software bug reports that include screenshots. The model takes a screenshot of a UI bug plus a text description and outputs structured triage metadata: severity level, affected component, bug type classification, root cause hypothesis, and suggested fix.
The text-only severity prediction baseline is benchmarked against published methods (SevPredict, MASP, BERT-SBR) on the standard Eclipse/Mozilla Defect Tracking Dataset (~215K bug reports). The multimodal extension uses Rico screenshots with programmatic bug injection to demonstrate that adding visual context improves triage accuracy over text-only approaches.
This project connects directly to CodeQ — CodeQ fixes bugs from code; VisionTriage triages bugs from visual reports before they reach a developer.
Qwen2.5-VL-7B-Instruct
QLoRA
Eclipse/Mozilla Dataset
Rico
Synthetic Bug Injection
HuggingFace TRL
Gradio
Featured
Parallel Multi-Agent Code Generation
DAG-Based Agent Orchestration for Code Synthesis
A DAG-based multi-agent code generation system built with LangGraph and the native Anthropic SDK. An orchestrator agent analyzes coding tasks, builds a dependency graph, and dispatches parallel async coder workers that generate, review, and test code through structured handoffs.
LangGraph
Anthropic SDK
asyncio
Python
Featured
Self-Evolving Code Generation
LLM-as-Judge · Autonomous Prompt Evolution
Extension of the multi-agent pipeline that adds an LLM-as-Judge evaluator, failure analyzer, and autonomous prompt evolver. The system forms a generation loop where the tester agent rewrites its own system prompt based on evaluation feedback, creating a self-improving code generation pipeline.
LangGraph
LLM-as-Judge
Prompt Evolution
JSON Tracker
Docker
Foundation
Multi-Agent Code Generation V1
Sequential Pipeline · LangSmith Tracing
Sequential multi-agent pipeline using an Orchestrator → Planner → Coder → Reviewer → Tester architecture. Includes LangSmith tracing for observability and Docker sandboxing for safe code execution. This was the foundation that led to the parallel and self-evolving versions.
LangChain
LangSmith
Docker
Python