Writing
Technical essays with equations, code, and citations
These notes are written for readers who already understand modern ML systems. They emphasize mechanisms, tradeoffs, and reproducible engineering observations.
Transformer Internals as a Systems Interface
The residual stream is the real systems interface of a transformer: training, inference, interpretability, and compression all negotiate with it.
Attention Mechanisms Under IO Pressure
The useful mental model for modern attention kernels is not the softmax equation; it is the path data takes through HBM, SRAM, registers, and warps.
CUDA Optimization Notes from an Attention Kernel
CUDA optimization is the discipline of making memory motion, register pressure, and occupancy legible enough to trade them deliberately.
Mamba Architectures in Hybrid LLM Training
Hybrid SSM-attention models are best treated as architectural experiments whose evaluation must cover long-context behavior, tokenizer behavior, and deployment cost together.
Sparse Models Fail Quietly Before They Fail Loudly
Sparse MoE systems can look healthy on loss curves while the router is already collapsing. Entropy and load metrics need to be first-class.
Inference Systems Are Acceptance-Rate Control Problems
Speculative decoding speedup is controlled by acceptance-rate dynamics, not merely by choosing a smaller draft model.
Mechanistic Interpretability Needs Negative Results
Failed grokking runs are not noise; they can expose representation capacity boundaries when paired with the right spectral and causal diagnostics.