Foundation¶
The Phase 0 work that everything else builds on: the base model we chose, the evaluation harness every method runs through, and the baseline numbers that define what "better" means for the methods to come.
Holding all three constant across phases is what makes the cross-method comparison fair — methods can't choose their own eval set, their own model size, or their own definition of "good."
Pages in this section¶
| Page | What it covers |
|---|---|
| Base model | Why pretrained Qwen/Qwen2.5-0.5B (model size, license, pretrained-not-aligned) |
| Eval harness | The four tasks (MMLU / GSM8K / TruthfulQA / IFEval), the lm-eval-harness wrapper, and how metrics are persisted |
| Baseline | Phase 0 numbers on the un-tuned base model — the bar every method has to clear |
Base model¶
Qwen/Qwen2.5-0.5B (HF Hub)
— a 494 M-parameter pretrained decoder-only LM from Alibaba (Sep 2024).
Not the -Instruct variant — see the rationale below.
Why this model (not a 7B, not Llama, not Mistral)¶
| Constraint | Implication | Why Qwen/Qwen2.5-0.5B satisfies it |
|---|---|---|
| Free-tier compute (Kaggle T4 16 GB) | Model must fit in ~16 GB with 4-bit base + LoRA + gradient checkpointing + batch=4 + optimizer states | 0.5 B base is ~1 GB in 4-bit; total run footprint ~6–8 GB |
| Apache 2.0 license | Adapters can be pushed to HF Hub publicly, results can be shared | Qwen2.5 family is Apache 2.0 |
| Pretrained, not aligned | SFT, DPO, PPO, GRPO each need room to actually move — re-SFT-ing already-aligned weights produces flat or regressing evals | The pretrained variant has not been instruction-tuned by Alibaba; each method has clear headroom |
| Comparable to known SFT recipe | Qwen's own -Instruct is essentially this base + SFT — gives a natural ceiling for what supervised tuning recovers |
The -Instruct numbers sit in metrics.json as a reference; method results are compared to the pretrained base but against the Instruct ceiling |
| Small enough to iterate | Phase 1 + 2 + 3 all need to fit in weekend-scale time budgets | Full SFT run finishes in ~30 min on a Modal L4 |
Why we switched away from -Instruct¶
The first SFT attempt used Qwen2.5-0.5B-Instruct as the base. That regressed
uniformly on every eval (−1 to −2.6pp). Re-SFT-ing already-aligned weights had
nothing to do; the comparison-repo lever needs an unaligned starting point so
each method can demonstrate its own contribution. See the
README Learnings section
and experiments/002
for the full story.
Why not bigger / not different¶
- Qwen2.5-1.5B is the stretch base per PROJECT.md §7 if the Phase 0–6 comparison lands cleanly. Same family means the comparison scales without a new license / tokenizer / template story.
- Llama-3.2-1B would also have worked. Qwen was picked because its tokenizer and chat template are well-documented and TRL's defaults handle them — with one patch — out of the box.
- TinyLlama / Pythia 1B were considered. They satisfy the "pretrained, not aligned" criterion but lack the strong public Instruct comparator that Qwen ships, so we'd lose the "vs Qwen's own SFT recipe" ceiling.
Tokenizer routing¶
Pretrained Qwen2.5-0.5B has pad == eos == <|endoftext|> (an SFT
supervision footgun — TRL masks pad in the labels, so the model never sees
eos in supervision) and a chat template without {% generation %} markers
(required for TRL's assistant-only loss mask). Both are fixed by loading the
-Instruct tokenizer instead — the vocab is byte-identical, but
pad=<|endoftext|>, eos=<|im_end|> (correctly distinct), and the chat
template is patched at load time to inject {% generation %} markers. See
src/atlas/models/base.py:patch_chat_template_for_assistant_mask.
Pinning¶
configs/base.yaml currently sets revision: null (latest HEAD on HF Hub).
Per PROJECT.md, once the Phase 0–2 results are locked in for the final
write-up, this should be pinned to the SHA seen at eval time so subsequent
reruns are bit-identical.