SprintLM

A tiny GPT you can read end‑to‑end.

From‑scratch GPT‑style Transformer trained on TinyStories in ~1 hour on a single T4. Decoder‑only with RoPE, pre‑RMSNorm, and SwiGLU. Code favors clarity over cleverness; the Python path intentionally omits a KV‑cache for readability.

Source on GitHub.

Run headline: val loss 2.713 → ppl 15.07; tokens 23.04M; throughput ~7.5k tok/s; 12 layers · d384 · 6 heads.

Architecture

vocab
≈50k (GPT‑2 BPE via tiktoken)
d_model
384
layers
12
heads
6
d_ff
1536 (SwiGLU)
context
192 tokens
pos enc
RoPE (θ=10000)
norm
RMSNorm (pre‑norm)
attn
masked MHA, causal

Training

optimizer
AdamW
lr
3e‑4
betas
(0.9, 0.95)
weight decay
0.1
eps
1e‑8
schedule
warmup 100 → cosine
batch
24
steps
5000
seed
7
device
1× T4 (free Colab)

Results

Training loss curve
Training loss over 5,000 steps (batch 24, context 192).
Validation loss curve
Validation loss on TinyStories. Final: 2.713 → ppl 15.07.
Tokens per second across training
Throughput on a T4. Dips correspond to eval/ckpt saves.

Sample

Generated with temperature 0.8, top‑k 40:

Once upon a time, there was a little girl named Lily. She loved to play with her toys all day long. One day, she found a special toy with a big rock. She was very excited to go on a trip to the park. Lily asked her mom, "Why is the same thing in the kitchen?" Lily replied, "No, it's a new boat. It's a nice car that looks like a toy. It's very cool for you." Lily looked at the book and saw the ball. She looked at it and said, "Lily,
prompt: "Once upon a time, "

Reproduce

Prepare dataset

python tools/prepare_hf.py \
  dataset_name=roneneldan/TinyStories text_field=text \
  out_dir=data/Prepared/TinyStoriesHF append_eot=true \
  dtype=uint16 tokenizer_name=gpt2

Train model

python -m sprintlm.train.train \
  dataset.train_bin=data/Prepared/TinyStoriesHF/train.bin \
  dataset.val_bin=data/Prepared/TinyStoriesHF/val.bin \
  train.batch_size=24 dataset.context_length=192 \
  train.max_steps=5000 train.log_every=50 \
  train.eval_every=250 train.save_every=250

Generate & visualize

python tools/plot_metrics.py outputs
python tools/decode_cli.py --ckpt outputs/<RUN>/ckpts/step5000.pt \
  --prompt "Once upon a time, " --context_length 192 \
  --temperature 0.8 --top_k 40

Platform notes: M4 MacBook (16 GB) use batch_size≈8‑12 with MPS. A100/H100 can scale to batch_size≥64 and context≥512.

Limits & next

Limits. TinyStories domain (child‑story style); not instruction‑tuned; Python decoder is intentionally minimal (no KV‑cache); GPT‑2 BPE only; no dataset filtering or dedup.

Next. Minimal C++17 CPU inference (weights export → single‑file greedy decoder); optional safetensors release; KV‑cache in Python for longer generations; basic quantization (int8/gguf) and latency measurements on M4 CPU; Colab quickstart notebook; model card on HF.