Scaling LLM Pretraining
A systems performance worklog on scaling LLM pretraining — starting on a single GPU and progressively exploring data, tensor, and pipeline parallelism.
An ongoing series on the systems performance side of pretraining language models. Each part scales the setup further — starting from a single-GPU training run with tight MFU discipline, then working through data parallelism, tensor parallelism, pipeline parallelism, and the overlap/communication tradeoffs that come with each.
The focus is measurement-first: feasibility math before code, throughput instrumentation over vibes, and accounting of where FLOPs actually go.
- Part 1
Training a 360M Parameter Model with Performance Discipline
Pretraining SmolLM-360M on a single A100 GPU within a 30-hour window, focusing on feasibility analysis, throughput measurement, and hardware efficiency optimization.