post-training-lab¶
A controlled, reproducible comparison of LLM post-training methods —
SFT, DPO, KTO, RLHF (reward model + PPO), and GRPO — applied to the same
small base model (Qwen/Qwen2.5-0.5B, pretrained — not the -Instruct
variant) with the same evaluation harness.
The comparison is the contribution, not any single method. Same LoRA rank, same quantization, same evaluation tasks, same JSON schema — apples-to-apples by construction.
Where to start¶
- Results — current numbers for base / sft_v2 / dpo_v1.
- Methods — explainers for each post-training method, including paper links and the exact hyperparameters this repo uses.
- LoRA / QLoRA — the parameter-efficient setup shared across every phase.
Status¶
| Phase | Scope | Status |
|---|---|---|
| 0 | Scaffolding + baseline eval | done |
| 1 | SFT on UltraChat-200k | done (sft_v2) |
| 2 | DPO on UltraFeedback-binarized | done (dpo_v1) |
| 3 | Reward model + PPO | next |
| 4 | GRPO / RLVR | planned |
| 5 | KTO or ORPO | planned |
| 6 | LLM-judge comparison + writeup | planned |
See PROJECT.md for the full charter and per-phase plan.