Series1 part · In progress

Scaling LLM Pretraining

A systems performance worklog on scaling LLM pretraining — starting on a single GPU and progressively exploring data, tensor, and pipeline parallelism.

An ongoing series on the systems performance side of pretraining language models. Each part scales the setup further — starting from a single-GPU training run with tight MFU discipline, then working through data parallelism, tensor parallelism, pipeline parallelism, and the overlap/communication tradeoffs that come with each.

The focus is measurement-first: feasibility math before code, throughput instrumentation over vibes, and accounting of where FLOPs actually go.

Parts

Part 1
Training a 360M Parameter Model with Performance Discipline
Pretraining SmolLM-360M on a single A100 GPU within a 30-hour window, focusing on feasibility analysis, throughput measurement, and hardware efficiency optimization.
Feb 08, 2026

Training a 360M Parameter Model with Performance Discipline