← All Posts

An Interactive Guide to Rooflines

The Roofline Model

Algorithms spend time in two places:

  1. Computation, where the time is spent performing arithmetic operations such as multiplying and adding floating point numbers. For example, a H100 GPU can perform 989 TFLOPs of TF32 operations per second.
  2. Communication, where the time is spent moving data within the GPU (b/w the cores and the HBM) or across the GPUs. The HBM bandwidth of a H100 is 3.35 TB/s and for GPU-to-GPU communication using Nvidia NVLink it is 900 GB/s.

Let TmathT_{math} be the time spent in computation and TcommsT_{comms} be the time spent in communication.

Tmath=Computation FLOPsAccelerator FLOPs/sT_{math} = \frac{\text{Computation FLOPs}}{\text{Accelerator FLOPs/s}} Tcomms=Communication GBNetwork/Memory Bandwidth GB/sT_{comms} = \frac{\text{Communication GB}}{\text{Network/Memory Bandwidth GB/s}}

We can lowerbound the time taken by an algorithm by assuming we can perfectly overlap communication with computation.

Tlowerbound=max(Tmath,Tcomms)T_{lowerbound} = \max(T_{math}, T_{comms})

And we can upperbound the time taken by assuming zero overlap.

Tupperbound=Tmath+TcommsT_{upperbound} = T_{math} + T_{comms}

We will optimize the algorithm to with respect to the TlowerboundT_{lowerbound} i.e., maximizing the overlap between computation and communication.

Assuming we can perfectly overlap communication with computation, we will have one of the following two cases:

  1. Tcomms>TmathT_{comms} > T_{math}: Algorithm is memory or communication bound. Some fraction of the accelerator FLOPs/s are spent waiting for data movement.
  2. Tmath>TcommsT_{math} > T_{comms}: Algorithm is compute bound. All the accelerator FLOPs/s are being used to compute i.e., we see full utilization of the compute capability of the hardware.

A proxy measure for checking if an algorithm is compute bound is the arithmetic intensity of the algorithm which measures FLOPs per byte of data moved. It is the ratio of the FLOPs the algorithm performs to the number of bytes it needs to move.

Arithmetic Intensity=Computation FLOPsCommunication Bytes\text{Arithmetic Intensity} = \frac{\text{Computation FLOPs}}{\text{Communication Bytes}}

We can derive the following simple theorem relating the arithmetic intensity of an algorithm and intensity of the accelerator that can help us figure out if an algorithm is compute bound or not:

Tmath>Tcomms    Computation FLOPsAccelerator FLOPs/s>Communication GBNetwork/Memory Bandwidth GB/s    Computation FLOPsCommunication GB>Accelerator FLOPs/sNetwork/Memory Bandwidth GB/s    Arithmetic Intensity>Accelerator FLOPs/sNetwork/Memory Bandwidth GB/s    Arithmetic Intensity>Accelerator Intensity\begin{aligned} T_{math} &> T_{comms} \\ \iff\quad \frac{\text{Computation FLOPs}}{\text{Accelerator FLOPs/s}} &> \frac{\text{Communication GB}}{\text{Network/Memory Bandwidth GB/s}} \\ \iff\quad \frac{\text{Computation FLOPs}}{\text{Communication GB}} &> \frac{\text{Accelerator FLOPs/s}}{\text{Network/Memory Bandwidth GB/s}} \\ \iff\quad \text{Arithmetic Intensity} &> \frac{\text{Accelerator FLOPs/s}}{\text{Network/Memory Bandwidth GB/s}} \\ \iff\quad \text{Arithmetic Intensity} &> \text{Accelerator Intensity} \end{aligned}
Anatomy of a roofline
Two axes
step 1 / 5
10⁻⁲10⁻ⁱ10⁰10ⁱ10⁲10⁳10⁴10^1010^1210^1410^16arithmetic intensity (FLOPs / byte)attainable performance (FLOPs/s)
Arithmetic intensity on x (FLOPs per byte moved). Attainable performance on y (FLOPs/s). Both log-scaled.
Hardware
Dtype
peak = 1979 TF/s
bw = 3.35 TB/s
ridge = 591 F/B

Arithmetic Intensity of Dot Product

The dot product of two vectors is a fundamental operation in many algorithms, including neural networks. Let's analyze the arithmetic intensity of the dot product operation and see if it's compute bound or not on an H100 GPU.

The dot product of two vectors a\mathbf{a} and b\mathbf{b} of size NN is defined as:

ab=i=1Naibi\mathbf{a}\cdot\mathbf{b} = \sum_{i = 1}^{N}a_{i}\cdot b_{i}

Let's find the arithmetic intensity of the dot product operation on vectors in bfloat16 precision. Each vector has NN elements requiring 2N2N bytes of memory. So, together the two vectors require 4N4N bytes of memory and writing the result back requires 2 bytes. Dot product requires NN multiplications and N1N - 1 additions which makes the total number of operations 2N12N - 1.

Arithmetic Intensity=Computation FLOPsCommunication Bytes=2N14N+2\text{Arithmetic Intensity} = \frac{\text{Computation FLOPs}}{\text{Communication Bytes}} = \frac{2N - 1}{4N + 2}

As NN approaches infinity, the arithmetic intensity approaches 12\frac{1}{2}. In other words, the dot product operation does 0.5 FLOPs per byte of data moved. Let's compare this with the peak arithmetic intensity of the H100 SXM GPU.

H100 Peak Arithmetic Intensity=1979×1012FLOPs/s3.35×1012Bytes/s=590.7FLOPs/Byte\text{H100 Peak Arithmetic Intensity} = \frac{1979 \times 10^{12}\,\text{FLOPs/s}}{3.35 \times 10^{12}\,\text{Bytes/s}} = 590.7\,\text{FLOPs/Byte}

The arithmetic intensity of the dot product is way less than the peak arithmetic intensity of the H100. This means that the dot product operation is memory bound and its performance can be improved by either increasing the bandwidth of the memory or by increasing the arithmetic intensity of the algorithm.

Let's plot both points on a single roofline: the dot product sits at 0.5 flops/byte, and we can compare against a matmul whose intensity grows with N. Drag the slider to watch matmul climb the memory slope and cross the ridge into the compute-bound regime.

Dot product vs. matmul
Same multiply-and-add, different data organization. Slide N and watch matmul cross the ridge while dot stays stuck.
10^-110^010^110^210^310^410^1010^1210^1410^16arithmetic intensity (FLOPs / byte)attainable perf (FLOPs/s)ridge 591memory-boundcompute-bounddot0.50 F/Bmatmul · N=6421.3 F/B → 71 TF/s
Matmul size N
N = 64 (memory-bound)
Hardware
Dtype
ridge = 591 F/B
matmul crosses at
N ≈ 1773
Time breakdown (assuming perfect overlap)
dot (1M)
T_math
1.06 ns
T_comms
1.25 µs
comms
1.2k× lead
matmul (N=64)
T_math
0.26 ns
T_comms
7.34 ns
comms
28× lead
Matmul is still memory-bound at N=64: T_comms (7.34 ns) dominates T_math (0.26 ns). Crank N higher to cross the ridge.

Reference: Roofline analysis