SprintLM

A tiny GPT you can read end‑to‑end.

From‑scratch GPT‑style Transformer trained on TinyStories in ~1 hour on a single T4. Decoder‑only with RoPE, pre‑RMSNorm, and SwiGLU. Code favors clarity over cleverness; the Python path intentionally omits a KV‑cache for readability.

Source on GitHub.

Run headline: val loss 2.713 → ppl 15.07; tokens 23.04M; throughput ~7.5k tok/s; 12 layers · d384 · 6 heads.

Architecture

vocab: ≈50k (GPT‑2 BPE via tiktoken)
d_model: 384
layers: 12
heads: 6
d_ff: 1536 (SwiGLU)
context: 192 tokens
pos enc: RoPE (θ=10000)
norm: RMSNorm (pre‑norm)
attn: masked MHA, causal

Training

optimizer: AdamW
lr: 3e‑4
betas: (0.9, 0.95)
weight decay: 0.1
eps: 1e‑8
schedule: warmup 100 → cosine
batch: 24
steps: 5000
seed: 7
device: 1× T4 (free Colab)

Results

Training loss curve — Training loss over 5,000 steps (batch 24, context 192).

Validation loss curve — Validation loss on TinyStories. Final: 2.713 → ppl 15.07.

Tokens per second across training — Throughput on a T4. Dips correspond to eval/ckpt saves.

Sample

Generated with temperature 0.8, top‑k 40:

Once upon a time, there was a little girl named Lily. She loved to play with her toys all day long. One day, she found a special toy with a big rock. She was very excited to go on a trip to the park. Lily asked her mom, "Why is the same thing in the kitchen?" Lily replied, "No, it's a new boat. It's a nice car that looks like a toy. It's very cool for you." Lily looked at the book and saw the ball. She looked at it and said, "Lily,

prompt: "Once upon a time, "

Reproduce

Prepare dataset

python tools/prepare_hf.py \
  dataset_name=roneneldan/TinyStories text_field=text \
  out_dir=data/Prepared/TinyStoriesHF append_eot=true \
  dtype=uint16 tokenizer_name=gpt2

Train model

python -m sprintlm.train.train \
  dataset.train_bin=data/Prepared/TinyStoriesHF/train.bin \
  dataset.val_bin=data/Prepared/TinyStoriesHF/val.bin \
  train.batch_size=24 dataset.context_length=192 \
  train.max_steps=5000 train.log_every=50 \
  train.eval_every=250 train.save_every=250

Generate & visualize

python tools/plot_metrics.py outputs
python tools/decode_cli.py --ckpt outputs/<RUN>/ckpts/step5000.pt \
  --prompt "Once upon a time, " --context_length 192 \
  --temperature 0.8 --top_k 40

Platform notes: M4 MacBook (16 GB) use batch_size≈8‑12 with MPS. A100/H100 can scale to batch_size≥64 and context≥512.

Limits & next

Limits. TinyStories domain (child‑story style); not instruction‑tuned; Python decoder is intentionally minimal (no KV‑cache); GPT‑2 BPE only; no dataset filtering or dedup.

Next. Minimal C++17 CPU inference (weights export → single‑file greedy decoder); optional safetensors release; KV‑cache in Python for longer generations; basic quantization (int8/gguf) and latency measurements on M4 CPU; Colab quickstart notebook; model card on HF.