LoRA and QLoRA¶
This page covers the parameter-efficient fine-tuning method this repo uses for
every post-training phase: QLoRA (LoRA + 4-bit base). The lora: block
in configs/base.yaml is held constant across SFT, DPO, KTO,
the reward model, PPO, and GRPO — the comparison across methods is the
contribution, so adapter capacity (rank, target modules, dropout) doesn't vary.
The quant: block has one method-specific exception: DPO (Phase 2)
disables 4-bit because it merges the SFT adapter into the base via
merge_and_unload before attaching the DPO LoRA — and merge_and_unload
needs full-precision weights. The merge-then-DPO recipe lives in
src/atlas/train/dpo.py;
the Phase 2 numbers are in
experiments/003.
What LoRA is¶
LoRA (Low-Rank Adaptation) freezes the pretrained weights \(W \in \mathbb{R}^{d \times k}\) and learns a low-rank update:
with \(r \ll \min(d, k)\). Only \(A\) and \(B\) are trained; the forward pass adds the term \(\frac{\alpha}{r} \cdot B A x\) to the frozen layer's output. Hu et al. (2021) showed this matches full fine-tuning quality at a fraction of the trainable-parameter count, because gradient updates during fine-tuning empirically have low intrinsic rank.
Paper
LoRA: Low-Rank Adaptation of Large Language Models — Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen (2021). arxiv:2106.09685
Why we use it here: the frozen base model can stay in 4-bit (see QLoRA below), so training-time memory for a 0.5B base + LoRA adapters is roughly 1 GB on a Kaggle T4 instead of ~10 GB for full fine-tuning in bf16. The same memory ratio holds at larger scales.
What QLoRA adds¶
QLoRA (Dettmers et al. 2023) is LoRA with three quantization-side changes that don't hurt quality:
- 4-bit base weights — the frozen \(W\) is stored in 4 bits per parameter using a custom datatype called NF4 (NormalFloat-4). NF4's code-points are placed at the quantiles of a \(\mathcal{N}(0,1)\) distribution rather than evenly spaced, which is information-theoretically optimal when weights are normally distributed (LLM weights empirically are).
- Double quantization — the quantization constants themselves are quantized again, saving another ~0.4 bits per parameter at no quality cost.
- Paged optimizer states — uses NVIDIA's unified-memory paging for the
optimizer to avoid OOM on long contexts. TRL's
SFTTrainerenables this transparently whenbitsandbytesis available.
The 4-bit weights are dequantized on-the-fly to a higher-precision compute dtype (bf16 on Ampere+; fp16 on Turing/Pascal — see compute dtype) for the matmul, then discarded. Backprop flows through the LoRA matrices \(A\) and \(B\) in the compute dtype; the frozen base never has gradients.
Paper
QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers, Pagnoni, Holtzman, Zettlemoyer (2023). arxiv:2305.14314
Our config¶
These values are in configs/base.yaml and inherited by every
experiment. Where the paper recommends a value, we use it; where the choice
is repo-specific (compute budget, model size, comparison constraint), the
reasoning is called out.
LoRA¶
| Knob | Value | Source | Why |
|---|---|---|---|
r (rank) |
16 | repo-specific | QLoRA paper swept r ∈ {8, 16, 64}; 16 is the smallest that didn't underperform on instruction-following for sub-1B models. r=8 underfits IFEval; r=64 is wasted capacity for a 0.5B base + 5k UltraChat. |
alpha |
32 | paper default | Sets the effective update scale \(\alpha/r = 2.0\). The "α = 2r" rule comes from QLoRA's empirics — it keeps update magnitude consistent across rank choices, so changing r doesn't require re-tuning the learning rate. |
dropout |
0.05 | paper default | Low end of QLoRA's 0.05–0.1 range because 5k UltraChat is a small-data regime and we don't want to over-regularize. |
target_modules |
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
paper default | "Attention + MLP" — every linear projection in a Qwen2 decoder block. QLoRA Table 4 showed this beats attention-only ("q,v" classic LoRA) by ~3 pts on MMLU, and the cost is small because all layers are still 4-bit underneath. |
bias |
"none" |
paper default (hard-coded in adapters.py) |
Don't train bias terms; saves params for no measurable quality gain on SFT. |
Trainable parameter count for Qwen2.5-0.5B with this config: roughly 2.1 M of 494 M total (≈0.4%).
Quantization¶
| Knob | Value | Source | Why |
|---|---|---|---|
load_in_4bit |
true (SFT/PPO/GRPO/KTO); false for DPO |
repo-wide with one override | Even though the 0.5B base fits without quantization, keeping 4-bit on means the code path is identical when we scale to larger bases later. The DPO YAML overrides to false because the merge-then-DPO recipe (fuse sft_v2 into the base before attaching the DPO LoRA) requires full-precision weights. |
bnb_4bit_quant_type |
nf4 |
paper default | NormalFloat-4 outperforms fp4 by ~1 pt on the QLoRA benchmark — "free quality" with no extra cost. |
bnb_4bit_compute_dtype |
bfloat16 |
paper default (Ampere+) | On free Kaggle T4 / Colab P100, the Kaggle notebook auto-patches this to float16 because Turing/Pascal emulate bf16 in software at ~half speed. |
double_quant |
true |
paper default | Free 0.4 bits/param savings. Always on. |
Training hyperparameters by phase¶
The LoRA block (rank, target modules, dropout) is constant across phases. Method-specific training knobs vary by phase and live in their own configs.
Phase 1 / SFT (configs/sft_qwen05b.yaml):
| Knob | Value | Why |
|---|---|---|
learning_rate |
2.0e-4 | QLoRA paper recommendation for LoRA adapters with r=16–64. Higher than full fine-tuning would tolerate because LoRA's low-rank update bounds the effective per-step weight change. |
per_device_train_batch_size |
4 | Memory ceiling with 0.5B + 4-bit + LoRA + gradient checkpointing. |
gradient_accumulation_steps |
4 | Effective batch = 16. Standard for instruction-tuning at this scale. |
gradient_checkpointing |
true |
Required to fit batch=4 inside 16 GB. Trades ~30% compute for ~50% memory. |
warmup_ratio |
0.03 | 3% of steps. Standard for short fine-tunes; longer warmup costs progress when total steps is ~313. |
num_train_epochs |
1 | One pass over the 5k slice. |
n_samples |
5000 | Per PROJECT.md §4.1. Small enough to iterate, large enough for SFT to move IFEval. |
assistant_only_loss |
true |
Mask user/system turns out of the loss. TRL 1.4 defaults this to False — the lack of this flag in Phase 1 v1 was a real bug; see SFT method page. |
Phase 2 / DPO (configs/dpo_qwen05b.yaml):
| Knob | Value | Why |
|---|---|---|
learning_rate |
5.0e-6 | 1–2 orders of magnitude lower than SFT's 2e-4 because the DPO loss surface is steeper than cross-entropy (grad norm ≈ 25× at init). HF alignment-handbook range. |
beta |
0.1 | Mitchell's recommendation for HH-style preference data; TRL's default. Conservative starting point. |
loss_type |
sigmoid |
Canonical DPO. TRL also exposes ipo, hinge, kto_pair. |
per_device_train_batch_size |
2 | DPO duplicates the forward (policy + ref) so smaller per-device batch than SFT; gradient_accumulation_steps: 8 keeps effective batch = 16. |
max_length |
1024 | UltraFeedback prompts + responses can be long; cap to bound memory. TRL 1.4 dropped max_prompt_length; max_length alone now governs the combined sequence. |
n_samples |
5000 | Matches SFT's row count for an apples-to-apples compute envelope. |
Decisions worth understanding¶
Why r=16 specifically (not 8 or 32)¶
The QLoRA paper (Table 9) swept r ∈ {8, 16, 32, 64, 256} on a 7B base. Below r=16 the paper saw underfitting on harder reasoning tasks; above r=64 was wasted compute. For a 0.5B base the "underfit boundary" shifts lower — but r=16 is the conservative choice that still has headroom.
Phase 1 (sft_v2) came in flat on IFEval prompt-strict (−0.37pp vs
pretrained base) — exactly the "obvious capacity ceiling" signal that would
normally argue for bumping to r=32. But Phase 2 DPO on top of sft_v2
cleared the bar (+0.74pp), suggesting the bottleneck was the method on 5k
rows, not adapter capacity. r=32 stays parked as a Phase 7 stretch
experiment rather than the next move.
Why the same lora: block across all methods¶
The PROJECT.md anti-goal: "the comparison [across methods] is the contribution." Tuning LoRA rank per method would let SFT use r=32 and DPO use r=8, then we'd be comparing "best-tuned SFT" to "median DPO" — which proves nothing about the methods themselves. So the LoRA capacity is held constant; only the method-specific knobs (loss, dataset format, RL-specific hparams) vary per experiment.
When not to use this config¶
- Different base model size: a 7B base would want r=8 (relatively smaller) for the same per-method budget; a 100 M base might want r=32 to have enough capacity.
- Continued pretraining (not fine-tuning): use full fine-tuning, not LoRA.
- Catastrophic-forgetting-sensitive domains (e.g. preserving code while teaching math): LoRA helps here precisely because the base is frozen; keep the config.
Compute dtype on T4 / P100¶
bf16 doesn't have hardware support on Turing (T4) or Pascal (P100). The math
still works — bnb dequantizes to bf16 and the GPU emulates it — but
throughput drops by roughly 50%. The Kaggle notebook
detects T4/P100 and rewrites dtype: bfloat16 → dtype: float16 in both
configs/base.yaml entries before training. fp16 has native
hardware support on every NVIDIA GPU back to Volta, so this is a strict win
for free-tier training. On Ampere (A100, RTX 30-series), L4, or Hopper
(H100), leave the default.
References¶
- Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arxiv:2106.09685
- Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arxiv:2305.14314
- Hugging Face PEFT: huggingface.co/docs/peft
- bitsandbytes 4-bit: github.com/bitsandbytes-foundation/bitsandbytes