Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Attention Mechanisms / 2026-01

Attention Mechanisms Under IO Pressure

The useful mental model for modern attention kernels is not the softmax equation; it is the path data takes through HBM, SRAM, registers, and warps.

10 min

AttentionFlashAttentionCUDA

Outline

  • Why materialized attention is a memory problem.
  • Online softmax as the correctness boundary.
  • Causal masking and tile scheduling.
  • Benchmarking kernel work without fooling yourself.

Equation

softmax(QKT)Vsoftmax(QK^T)V

References