Series1 part · In progress
CPU Performance Engineering
Optimizing SGEMM and prefix sum kernels from naive implementations to within 50% of hardware peak on an Apple M4 Pro.
An ongoing series where I take two computational kernels — SGEMM (single-precision matrix multiply) and prefix sum — from naive implementations to within 50% of hardware peak on a single P-core of my M4 Pro.
The focus is on developing performance discipline: measure, hypothesize about the bottleneck, apply one optimization, re-measure, and verify whether the hypothesis was right.
Parts
- Part 1
Machine Baseline for CPU Performance Engineering on an M4 Pro
Establishing single-core FP32 compute, DRAM bandwidth, and cache hierarchy ceilings on Apple M4 Pro as denominators for kernel optimization.