Results¶
The cross-method comparison this project is built around. One row per training run; every row uses the same base model, the same eval harness, and the same JSON schema. Apples-to-apples by construction.
Live source: results/metrics.json.
The table below is updated manually when each phase lands — the workflow
that auto-renders it from metrics.json is a Phase 6 nicety, not yet built.
Current state¶
| Method | Phase | config_hash | MMLU | GSM8K (strict) | TruthfulQA | IFEval prompt-strict | IFEval inst-strict |
|---|---|---|---|---|---|---|---|
base (pretrained Qwen/Qwen2.5-0.5B) |
0 | fde0720e |
0.4813 | 0.3389 | 0.3988 | 0.1238 | 0.2278 |
| sft_v2 (UltraChat-200k, 5k, QLoRA r=16, 1 epoch) | 1 | b133712d |
0.4713 | 0.3450 | 0.3893 | 0.1201 | 0.2398 |
| dpo_v1 (UltraFeedback, 5k pairs, β=0.1, 1 epoch) | 2 | d53fd258 |
0.4802 | 0.3495 | 0.3958 | 0.1275 | 0.2422 |
| rm_v1 | 3 | planned | — | — | — | — | — |
| ppo_v1 | 3 | planned | — | — | — | — | — |
| grpo_v1 | 4 | planned | — | — | — | — | — |
| kto_v1 or orpo_v1 | 5 | planned | — | — | — | — | — |
Deltas¶
| Metric | sft_v2 − base | dpo_v1 − sft_v2 | dpo_v1 − base |
|---|---|---|---|
| MMLU | −1.00pp | +0.89pp | −0.11pp |
| GSM8K strict | +0.61pp | +0.45pp | +1.06pp |
| TruthfulQA | −0.95pp | +0.65pp | −0.30pp |
| IFEval prompt-strict | −0.37pp (flat) | +0.74pp | +0.37pp |
| IFEval inst-strict | +1.20pp | +0.24pp | +1.44pp |
The headline finding: SFT on a 0.5B with 5k UltraChat rows was flat on
the headline IFEval prompt-strict metric. DPO on top of that SFT moved every
lm-eval metric in the right direction and put dpo_v1 slightly above the
pretrained base on IFEval prompt-strict — the first policy in this lab to
clear that bar. Full Phase 1 story in
experiments/002_sft_qwen05b.md;
Phase 2 in
experiments/003_dpo_qwen05b.md.
What's tracked beyond this table¶
| Signal | Where it lives | Purpose |
|---|---|---|
| Pairwise judge win rate vs SFT | results/judge/*.json (Phase 6) |
DPO's formal success criterion per PROJECT.md §6. Not yet measured. |
| KL divergence from reference | W&B run logs | PPO sanity — verifies the policy isn't drifting wildly from SFT. |
| Reward model accuracy on held-out pairs | results/rm/*.json (Phase 3+) |
Reward model isn't load-bearing if it can't distinguish preferred from rejected. |
Per-task stderr from lm-eval |
metrics.json (dropped from flattened keys, present in raw) |
When two runs differ by less than 2 * stderr, the difference isn't meaningful. |
How to interpret a comparison¶
A method "winning" is not "highest IFEval." It's:
- Clear IFEval improvement over base (the floor everyone has to clear).
- No catastrophic regression on MMLU or TruthfulQA. If a method bumps IFEval by 5 points but drops MMLU by 8, that's not a win — it's a forgetting pathology.
- Reproducible. Same
config_hash, same data revision, same seed → same metrics within noise. PROJECT.md treats irreproducible deltas as nonexistent.
How the comparison stays fair¶
Five things are held constant across every row:
- Base model: pretrained
Qwen/Qwen2.5-0.5B, same revision. - LoRA setup: r=16, α=32, same target modules (LoRA / QLoRA page).
- Quantization: 4-bit NF4 + double quant for SFT (Phase 1); bf16 for
DPO (Phase 2 disables quant to allow
merge_and_unloadof the SFT adapter). - Eval tasks: the four in Eval harness, same
num_fewshotandlimit. - Random seed: 42 across data shuffling, weight init, eval sampling.
Only the method-specific knobs (loss, dataset format, RL hparams) vary per row. That's the load-bearing fairness invariant of the project.