Attention Mechanisms / 2026-01
Attention Mechanisms Under IO Pressure
The useful mental model for modern attention kernels is not the softmax equation; it is the path data takes through HBM, SRAM, registers, and warps.
10 min
AttentionFlashAttentionCUDA
Outline
- Why materialized attention is a memory problem.
- Online softmax as the correctness boundary.
- Causal masking and tile scheduling.
- Benchmarking kernel work without fooling yourself.