An Interactive Guide to Rooflines

The Roofline Model

Algorithms spend time in two places:

Computation, where the time is spent performing arithmetic operations such as multiplying and adding floating point numbers. For example, a H100 GPU can perform 989 TFLOPs of TF32 operations per second.
Communication, where the time is spent moving data within the GPU (b/w the cores and the HBM) or across the GPUs. The HBM bandwidth of a H100 is 3.35 TB/s and for GPU-to-GPU communication using Nvidia NVLink it is 900 GB/s.

Let $T_{math}$ be the time spent in computation and $T_{comms}$ be the time spent in communication.

T_{math} = \frac{\text{Computation FLOPs}}{\text{Accelerator FLOPs/s}}

T_{comms} = \frac{\text{Communication GB}}{\text{Network/Memory Bandwidth GB/s}}

We can lowerbound the time taken by an algorithm by assuming we can perfectly overlap communication with computation.

T_{lowerbound} = \max(T_{math}, T_{comms})

And we can upperbound the time taken by assuming zero overlap.

T_{upperbound} = T_{math} + T_{comms}

We will optimize the algorithm to with respect to the $T_{lowerbound}$ i.e., maximizing the overlap between computation and communication.

Assuming we can perfectly overlap communication with computation, we will have one of the following two cases:

$T_{comms} > T_{math}$ : Algorithm is memory or communication bound. Some fraction of the accelerator FLOPs/s are spent waiting for data movement.
$T_{math} > T_{comms}$ : Algorithm is compute bound. All the accelerator FLOPs/s are being used to compute i.e., we see full utilization of the compute capability of the hardware.

A proxy measure for checking if an algorithm is compute bound is the arithmetic intensity of the algorithm which measures FLOPs per byte of data moved. It is the ratio of the FLOPs the algorithm performs to the number of bytes it needs to move.

\text{Arithmetic Intensity} = \frac{\text{Computation FLOPs}}{\text{Communication Bytes}}

We can derive the following simple theorem relating the arithmetic intensity of an algorithm and intensity of the accelerator that can help us figure out if an algorithm is compute bound or not:

\begin{aligned} T_{math} &> T_{comms} \\ \iff\quad \frac{\text{Computation FLOPs}}{\text{Accelerator FLOPs/s}} &> \frac{\text{Communication GB}}{\text{Network/Memory Bandwidth GB/s}} \\ \iff\quad \frac{\text{Computation FLOPs}}{\text{Communication GB}} &> \frac{\text{Accelerator FLOPs/s}}{\text{Network/Memory Bandwidth GB/s}} \\ \iff\quad \text{Arithmetic Intensity} &> \frac{\text{Accelerator FLOPs/s}}{\text{Network/Memory Bandwidth GB/s}} \\ \iff\quad \text{Arithmetic Intensity} &> \text{Accelerator Intensity} \end{aligned}

Anatomy of a roofline

Two axes

step 1 / 5

Arithmetic Intensity of Dot Product

The dot product of two vectors is a fundamental operation in many algorithms, including neural networks. Let's analyze the arithmetic intensity of the dot product operation and see if it's compute bound or not on an H100 GPU.

The dot product of two vectors $\mathbf{a}$ and $\mathbf{b}$ of size $N$ is defined as:

\mathbf{a}\cdot\mathbf{b} = \sum_{i = 1}^{N}a_{i}\cdot b_{i}

Let's find the arithmetic intensity of the dot product operation on vectors in bfloat16 precision. Each vector has $N$ elements requiring $2N$ bytes of memory. So, together the two vectors require $4N$ bytes of memory and writing the result back requires 2 bytes. Dot product requires $N$ multiplications and $N - 1$ additions which makes the total number of operations $2N - 1$ .

\text{Arithmetic Intensity} = \frac{\text{Computation FLOPs}}{\text{Communication Bytes}} = \frac{2N - 1}{4N + 2}

As $N$ approaches infinity, the arithmetic intensity approaches $\frac{1}{2}$ . In other words, the dot product operation does 0.5 FLOPs per byte of data moved. Let's compare this with the peak arithmetic intensity of the H100 SXM GPU.

\text{H100 Peak Arithmetic Intensity} = \frac{1979 \times 10^{12}\,\text{FLOPs/s}}{3.35 \times 10^{12}\,\text{Bytes/s}} = 590.7\,\text{FLOPs/Byte}

The arithmetic intensity of the dot product is way less than the peak arithmetic intensity of the H100. This means that the dot product operation is memory bound and its performance can be improved by either increasing the bandwidth of the memory or by increasing the arithmetic intensity of the algorithm.

Let's plot both points on a single roofline: the dot product sits at 0.5 flops/byte, and we can compare against a matmul whose intensity grows with N. Drag the slider to watch matmul climb the memory slope and cross the ridge into the compute-bound regime.

Dot product vs. matmul

Same multiply-and-add, different data organization. Slide N and watch matmul cross the ridge while dot stays stuck.

Time breakdown (assuming perfect overlap)

dot (1M)

T_math

1.06 ns

T_comms

1.25 µs

comms
1.2k× lead

matmul (N=64)

T_math

0.26 ns

T_comms

7.34 ns

comms
28× lead

Matmul is still memory-bound at N=64: T_comms (7.34 ns) dominates T_math (0.26 ns). Crank N higher to cross the ridge.

Reference: Roofline analysis