post-training-lab¶

A controlled, reproducible comparison of LLM post-training methods — SFT, DPO, KTO, RLHF (reward model + PPO), and GRPO — applied to the same small base model (Qwen/Qwen2.5-0.5B, pretrained — not the -Instruct variant) with the same evaluation harness.

The comparison is the contribution, not any single method. Same LoRA rank, same quantization, same evaluation tasks, same JSON schema — apples-to-apples by construction.

Where to start¶

Results — current numbers for base / sft_v2 / dpo_v1.
Methods — explainers for each post-training method, including paper links and the exact hyperparameters this repo uses.
LoRA / QLoRA — the parameter-efficient setup shared across every phase.

Status¶

Phase	Scope	Status
0	Scaffolding + baseline eval	done
1	SFT on UltraChat-200k	done (sft_v2)
2	DPO on UltraFeedback-binarized	done (dpo_v1)
3	Reward model + PPO	next
4	GRPO / RLVR	planned
5	KTO or ORPO	planned
6	LLM-judge comparison + writeup	planned

See PROJECT.md for the full charter and per-phase plan.