Foundation¶

The Phase 0 work that everything else builds on: the base model we chose, the evaluation harness every method runs through, and the baseline numbers that define what "better" means for the methods to come.

Holding all three constant across phases is what makes the cross-method comparison fair — methods can't choose their own eval set, their own model size, or their own definition of "good."

Pages in this section¶

Page	What it covers
Base model	Why pretrained `Qwen/Qwen2.5-0.5B` (model size, license, pretrained-not-aligned)
Eval harness	The four tasks (MMLU / GSM8K / TruthfulQA / IFEval), the lm-eval-harness wrapper, and how metrics are persisted
Baseline	Phase 0 numbers on the un-tuned base model — the bar every method has to clear

Base model¶

Qwen/Qwen2.5-0.5B (HF Hub) — a 494 M-parameter pretrained decoder-only LM from Alibaba (Sep 2024). Not the -Instruct variant — see the rationale below.

Why this model (not a 7B, not Llama, not Mistral)¶

Constraint	Implication	Why `Qwen/Qwen2.5-0.5B` satisfies it
Free-tier compute (Kaggle T4 16 GB)	Model must fit in ~16 GB with 4-bit base + LoRA + gradient checkpointing + batch=4 + optimizer states	0.5 B base is ~1 GB in 4-bit; total run footprint ~6–8 GB
Apache 2.0 license	Adapters can be pushed to HF Hub publicly, results can be shared	Qwen2.5 family is Apache 2.0
Pretrained, not aligned	SFT, DPO, PPO, GRPO each need room to actually move — re-SFT-ing already-aligned weights produces flat or regressing evals	The pretrained variant has not been instruction-tuned by Alibaba; each method has clear headroom
Comparable to known SFT recipe	Qwen's own `-Instruct` is essentially this base + SFT — gives a natural ceiling for what supervised tuning recovers	The `-Instruct` numbers sit in `metrics.json` as a reference; method results are compared to the pretrained base but against the Instruct ceiling
Small enough to iterate	Phase 1 + 2 + 3 all need to fit in weekend-scale time budgets	Full SFT run finishes in ~30 min on a Modal L4

Why we switched away from `-Instruct`¶

The first SFT attempt used Qwen2.5-0.5B-Instruct as the base. That regressed uniformly on every eval (−1 to −2.6pp). Re-SFT-ing already-aligned weights had nothing to do; the comparison-repo lever needs an unaligned starting point so each method can demonstrate its own contribution. See the README Learnings section and experiments/002 for the full story.

Why not bigger / not different¶

Qwen2.5-1.5B is the stretch base per PROJECT.md §7 if the Phase 0–6 comparison lands cleanly. Same family means the comparison scales without a new license / tokenizer / template story.
Llama-3.2-1B would also have worked. Qwen was picked because its tokenizer and chat template are well-documented and TRL's defaults handle them — with one patch — out of the box.
TinyLlama / Pythia 1B were considered. They satisfy the "pretrained, not aligned" criterion but lack the strong public Instruct comparator that Qwen ships, so we'd lose the "vs Qwen's own SFT recipe" ceiling.

Tokenizer routing¶

Pretrained Qwen2.5-0.5B has pad == eos == <|endoftext|> (an SFT supervision footgun — TRL masks pad in the labels, so the model never sees eos in supervision) and a chat template without {% generation %} markers (required for TRL's assistant-only loss mask). Both are fixed by loading the -Instruct tokenizer instead — the vocab is byte-identical, but pad=<|endoftext|>, eos=<|im_end|> (correctly distinct), and the chat template is patched at load time to inject {% generation %} markers. See src/atlas/models/base.py:patch_chat_template_for_assistant_mask.

Pinning¶

configs/base.yaml currently sets revision: null (latest HEAD on HF Hub). Per PROJECT.md, once the Phase 0–2 results are locked in for the final write-up, this should be pinned to the SHA seen at eval time so subsequent reruns are bit-identical.