Results¶

The cross-method comparison this project is built around. One row per training run; every row uses the same base model, the same eval harness, and the same JSON schema. Apples-to-apples by construction.

Live source: results/metrics.json. The table below is updated manually when each phase lands — the workflow that auto-renders it from metrics.json is a Phase 6 nicety, not yet built.

Current state¶

Method	Phase	config_hash	MMLU	GSM8K (strict)	TruthfulQA	IFEval prompt-strict	IFEval inst-strict
base (pretrained `Qwen/Qwen2.5-0.5B`)	0	`fde0720e`	0.4813	0.3389	0.3988	0.1238	0.2278
sft_v2 (UltraChat-200k, 5k, QLoRA r=16, 1 epoch)	1	`b133712d`	0.4713	0.3450	0.3893	0.1201	0.2398
dpo_v1 (UltraFeedback, 5k pairs, β=0.1, 1 epoch)	2	`d53fd258`	0.4802	0.3495	0.3958	0.1275	0.2422
rm_v1	3	planned	—	—	—	—	—
ppo_v1	3	planned	—	—	—	—	—
grpo_v1	4	planned	—	—	—	—	—
kto_v1 or orpo_v1	5	planned	—	—	—	—	—

Deltas¶

Metric	sft_v2 − base	dpo_v1 − sft_v2	dpo_v1 − base
MMLU	−1.00pp	+0.89pp	−0.11pp
GSM8K strict	+0.61pp	+0.45pp	+1.06pp
TruthfulQA	−0.95pp	+0.65pp	−0.30pp
IFEval prompt-strict	−0.37pp (flat)	+0.74pp	+0.37pp
IFEval inst-strict	+1.20pp	+0.24pp	+1.44pp

The headline finding: SFT on a 0.5B with 5k UltraChat rows was flat on the headline IFEval prompt-strict metric. DPO on top of that SFT moved every lm-eval metric in the right direction and put dpo_v1 slightly above the pretrained base on IFEval prompt-strict — the first policy in this lab to clear that bar. Full Phase 1 story in experiments/002_sft_qwen05b.md; Phase 2 in experiments/003_dpo_qwen05b.md.

What's tracked beyond this table¶

Signal	Where it lives	Purpose
Pairwise judge win rate vs SFT	`results/judge/*.json` (Phase 6)	DPO's formal success criterion per PROJECT.md §6. Not yet measured.
KL divergence from reference	W&B run logs	PPO sanity — verifies the policy isn't drifting wildly from SFT.
Reward model accuracy on held-out pairs	`results/rm/*.json` (Phase 3+)	Reward model isn't load-bearing if it can't distinguish preferred from rejected.
Per-task `stderr` from lm-eval	`metrics.json` (dropped from flattened keys, present in raw)	When two runs differ by less than `2 * stderr`, the difference isn't meaningful.

How to interpret a comparison¶

A method "winning" is not "highest IFEval." It's:

Clear IFEval improvement over base (the floor everyone has to clear).
No catastrophic regression on MMLU or TruthfulQA. If a method bumps IFEval by 5 points but drops MMLU by 8, that's not a win — it's a forgetting pathology.
Reproducible. Same config_hash, same data revision, same seed → same metrics within noise. PROJECT.md treats irreproducible deltas as nonexistent.

How the comparison stays fair¶

Five things are held constant across every row:

Base model: pretrained Qwen/Qwen2.5-0.5B, same revision.
LoRA setup: r=16, α=32, same target modules (LoRA / QLoRA page).
Quantization: 4-bit NF4 + double quant for SFT (Phase 1); bf16 for DPO (Phase 2 disables quant to allow merge_and_unload of the SFT adapter).
Eval tasks: the four in Eval harness, same num_fewshot and limit.
Random seed: 42 across data shuffling, weight init, eval sampling.

Only the method-specific knobs (loss, dataset format, RL hparams) vary per row. That's the load-bearing fairness invariant of the project.