Post-Training GPT-OSS-20B for ML Research Coding

This project looks at whether a strong open-weight model can be meaningfully improved for a narrow but practical target — ML research coding. The code is on GitHub at sabdulmajid/gpt-oss-research. The benchmark is focused on the kind of code that actually shows up in research workflows (tensor manipulation, PyTorch modules, training loops, distributed training utilities, debugging, and performance-sensitive model code). And the core question is whether post-training can make GPT-OSS-20B more reliable at producing executable, standalone ML research code. The word executable is doing a lot of work in that sentence. For this domain, a response that's directionally correct but misses an import, changes the required function name, or wraps code in prose is still a failure. Evaluation has to measure runnable behavior, not just whether the output looks plausible.

What We Built

We put together an internal 51-task development benchmark for ML research coding and used it to compare several checkpoints.

Model / Checkpoint	Result
Base GPT-OSS-20B	26 / 51
Original SFT checkpoint	19 / 51
Clean SFT checkpoint, early stopped	28 / 51

The base model was already fairly capable. The first supervised fine-tuning attempt made it worse. Most of the work from there became understanding whether that regression was real, whether the evaluator was broken, and what kind of training data would actually help.

Evaluator Repair

The first failure mode was measurement itself. The model often emitted final-answer markers, code fences, or prose before and after code, and some formatting was just slightly off. Some failures were real, but others were extraction artifacts. We fixed the evaluator to recover the intended final code more faithfully while preserving the requirement that submitted code must actually execute. After that repair, oracle and reference solutions passed all 51 tasks, which gave us reasonable confidence the tasks were valid. The original SFT checkpoint improved after rescoring but still sat below the base model. The regression wasn't just an evaluator bug.

Why the First SFT Run Regressed

Supervised fine-tuning only helps if the training distribution matches the behavior you want at inference time. For this benchmark, that means producing exactly the requested function or object, including all required imports, avoiding trailing explanation that breaks execution, staying within context length without truncating the target, and actually solving the algorithmic problem rather than just imitating research-code style. The original SFT data violated several of these. A substantial fraction of examples were too long for the configured sequence length, so assistant targets were getting truncated during training. Many examples were prose-heavy. Some code examples used PyTorch, NumPy, or math without standalone imports. The failure pattern matched this theory pretty closely. The original checkpoint frequently failed due to missing imports, syntax issues, incomplete answers, or missing required symbols. When we ran a diagnostic-only import injection, it recovered a lot of failures, which confirmed that import completeness was a big part of the gap. We didn't count that as an official score, and it was only used to understand what was going wrong. What the model had learned was the style of ML code, not the discipline of producing complete, executable answers.

Clean SFT

We then built smaller, cleaner supervised datasets constrained around executable standalone code. The main changes were keeping examples short enough to avoid assistant truncation, removing evaluation prompt overlap, preferring parseable code-like targets, stripping out prose prefixes and trailing explanations, requiring standalone imports, and training from the base model rather than continuing from the regressed checkpoint. The clean SFT run used LoRA rather than full fine-tuning, keeping only a small fraction of parameters trainable across attention and MLP-style linear modules and selected expert projections. The run was intentionally conservative with short training, deterministic evaluation, and careful checkpoint comparison. The results were instructive.

Checkpoint	Result
Clean SFT step 50	28 / 51
Clean SFT step 100	19 / 51

The step-50 checkpoint beat both the base model and the original SFT checkpoint. The step-100 checkpoint regressed sharply. On a small, clean dataset, more training wasn't better. The model quickly picked up the intended format and behavior, then started drifting.

What Improved

The clean early-stopped checkpoint moved the model from 26/51 to 28/51 on the internal development benchmark. Relative to the original SFT checkpoint, it recovered most of the lost performance. The win wasn't that the model became broadly smarter. It became more likely to produce usable code in the expected shape — and for this kind of evaluation, that's the win that matters. Post-training can improve reliability by aligning output structure, imports, and task format even when it doesn't substantially improve deep algorithmic reasoning.

What Still Fails

The remaining failures are mostly concentrated around missing imports, runtime errors, incomplete executable structure, and genuine algorithmic or semantic mistakes. A diagnostic import-injection experiment suggested there's still a fair amount of recoverable performance if standalone completeness improves, but patching model outputs during official evaluation isn't the right answer. The model needs to learn to emit complete answers on its own.

GRPO Status

We also explored GRPO-style reinforcement learning, and the early attempt didn't go well enough to claim any progress. Sampled completions received zero reward across logged steps, gradients were effectively uninformative, and memory pressure was high. That run was stopped before producing a meaningful checkpoint. The key takeaway is that RL should not be layered on top of a weak or poorly shaped SFT checkpoint. If the model cannot reliably produce executable candidates, sparse pass/fail rewards do not give it much to learn from. The next GRPO attempt will be built upon a clean SFT checkpoint, use tighter generation settings, trace rewards carefully, and verify nonzero learning signal before scaling.

Where Things Stand

A few practical conclusions from the project so far. Evaluation extraction matters, but it cannot explain away all regressions. Target truncation is particularly damaging for code fine-tuning. Small, clean, executable examples can improve reliability quickly. The biggest surprise was that the best checkpoint appeared well before the end of training, which is a good reminder that more training is not always better, especially on small datasets. For ML research code, the difference between "looks right" and "runs correctly" is the entire ballgame. Code post-training should optimize for executable contracts, not just code-shaped text.

Next Steps

The next phase is turning the current development win into something more robust. That means expanding the clean standalone-code dataset without benchmark leakage, adding more import-complete PyTorch and tensor-programming examples, carefully testing longer-context clean SFT variants, and evaluating on external coding holdouts rather than just the internal benchmark. GRPO is still on the table, but only after reward signal and memory behavior are properly validated, and only after separating formatting and completeness failures from true reasoning failures. Clean SFT moved GPT-OSS-20B above the base model on the internal benchmark, and more importantly, there is now a much clearer diagnosis of what went wrong and a sharper path for improving executable reliability.