A tiny GPT you can read end‑to‑end.
From‑scratch GPT‑style Transformer trained on TinyStories in ~1 hour on a single T4. Decoder‑only with RoPE, pre‑RMSNorm, and SwiGLU. Code favors clarity over cleverness; the Python path intentionally omits a KV‑cache for readability.
Source on GitHub.
Run headline: val loss 2.713 → ppl 15.07; tokens 23.04M; throughput ~7.5k tok/s; 12 layers · d384 · 6 heads.
Generated with temperature 0.8, top‑k 40:
python tools/prepare_hf.py \ dataset_name=roneneldan/TinyStories text_field=text \ out_dir=data/Prepared/TinyStoriesHF append_eot=true \ dtype=uint16 tokenizer_name=gpt2
python -m sprintlm.train.train \ dataset.train_bin=data/Prepared/TinyStoriesHF/train.bin \ dataset.val_bin=data/Prepared/TinyStoriesHF/val.bin \ train.batch_size=24 dataset.context_length=192 \ train.max_steps=5000 train.log_every=50 \ train.eval_every=250 train.save_every=250
python tools/plot_metrics.py outputs python tools/decode_cli.py --ckpt outputs/<RUN>/ckpts/step5000.pt \ --prompt "Once upon a time, " --context_length 192 \ --temperature 0.8 --top_k 40
Platform notes: M4 MacBook (16 GB) use batch_size≈8‑12 with MPS. A100/H100 can scale to batch_size≥64 and context≥512.
Limits. TinyStories domain (child‑story style); not instruction‑tuned; Python decoder is intentionally minimal (no KV‑cache); GPT‑2 BPE only; no dataset filtering or dedup.
Next. Minimal C++17 CPU inference (weights export → single‑file greedy decoder); optional safetensors release; KV‑cache in Python for longer generations; basic quantization (int8/gguf) and latency measurements on M4 CPU; Colab quickstart notebook; model card on HF.