Global research index
Search every project, paper, benchmark, note, experiment, and article
The local index works immediately. After static export, Pagefind indexes the generated HTML for production search on Vercel.
Grokking Beyond Addition: Circuit-Level Analysis of Algebraic Learning in Transformers
This work extends grokking analysis beyond modular addition to eight algebraic operations across abelian fields, a composite ring, and non-abelian groups. A controlled transformer setup isolates when memorized algorithms become reusable circuits and when representation complexity blocks generalization.
Adaptive Tensor-Network Compression of LLMs: An Extension of CompactifAI
This project reproduces and extends CompactifAI-style tensor-network compression on real open-weight LLMs. It profiles layer sensitivity, replaces uniform bond dimensions with adaptive schedules, and evaluates healing runs across standard language benchmarks.
FlashAttention-2 CUDA Kernel
Custom IO-aware GPU attention engine. Rebuild attention from the memory hierarchy upward and understand exactly where framework kernels spend bandwidth, registers, shared memory, and occupancy.
Project Chimera
700M hybrid Mamba-2 and Transformer LLM trained from scratch. Own the full path from tokenizer and architecture decisions through pretraining, reasoning fine-tuning, alignment, quantization, and local inference.
Speculative Decoding Runtime
OpenAI-compatible draft-verifier inference server. Increase tokens per second without changing the target model distribution, while keeping the serving API compatible with existing OpenAI-style clients.
vLLM: PR #38816
Disaggregated prefill pipeline hang caused decode nodes to miss KV cache tensors. Implemented request-ID normalization at the prefill-decode boundary, refactored KV cache lookup semantics, and added targeted tests for matched and mismatched ID formats. Resolved indefinite hangs in distributed inference deployments and improved reliability for high-throughput disaggregated serving.
Speculative decoding
2.4x measured result in Mani Pal's research lab index.
FlashAttention-2 kernel
2.1x measured result in Mani Pal's research lab index.
Sparse MoE
2.3x measured result in Mani Pal's research lab index.
API p95 reduction
35% measured result in Mani Pal's research lab index.
MPO memory reduction
93% measured result in Mani Pal's research lab index.
Non-Abelian Grokking Capacity Ceiling
Can the same one-layer transformer that groks abelian operations grok non-abelian groups under longer training? Longer training alone did not cross the boundary; the failure is likely capacity or representation-geometry constrained rather than an optimizer patience issue.
Sparse MoE Scaling
Top-k routing, z-loss load balancing, and expert collapse analysis. Reproduce the practical failure modes of sparse MoE training, especially expert-utilization collapse, under a controlled matched-FLOP benchmark.
VAANI
Hindi-first fully offline voice assistant. Build a local voice assistant that keeps speech, reasoning, and synthesis offline while preserving practical latency on consumer CPU hardware.
Credit Assignment in Spiking Neural Networks: Bridging Bioplausibility and Scalability
This investigation studies how scalable gradient methods and biologically plausible local learning rules diverge when training recurrent spiking neural networks on temporal tasks.
Uniform MPO Compression Collapse
Does a single global bond dimension preserve quality across all transformer layers? Layer sensitivity is too uneven for uniform MPO schedules to be the final compression policy.
Fixed-Gamma Speculative Decoding
Is a fixed draft lookahead window enough for CPU speculative decoding? Fixed gamma creates brittle prompt-family dependence.
Sparse MoE Router Entropy
Can routing entropy predict expert collapse before validation loss exposes it? Routing entropy is a leading diagnostic for MoE health.
Activation Patching for Algebraic Circuits
Can causal patching separate memorized lookup behavior from algorithmic circuit behavior? This is the next causal validation layer for the grokking study.
KV Cache ID Normalization Tests
Can a unit-level reproduction catch prefill/decode request-ID mismatches before deployment hangs? Distributed inference bugs need contract tests around identifiers and transfer semantics.
Transformer Internals as a Systems Interface
The residual stream is the real systems interface of a transformer: training, inference, interpretability, and compression all negotiate with it.
Attention Mechanisms Under IO Pressure
The useful mental model for modern attention kernels is not the softmax equation; it is the path data takes through HBM, SRAM, registers, and warps.
CUDA Optimization Notes from an Attention Kernel
CUDA optimization is the discipline of making memory motion, register pressure, and occupancy legible enough to trade them deliberately.
Mamba Architectures in Hybrid LLM Training
Hybrid SSM-attention models are best treated as architectural experiments whose evaluation must cover long-context behavior, tokenizer behavior, and deployment cost together.
Sparse Models Fail Quietly Before They Fail Loudly
Sparse MoE systems can look healthy on loss curves while the router is already collapsing. Entropy and load metrics need to be first-class.
Inference Systems Are Acceptance-Rate Control Problems
Speculative decoding speedup is controlled by acceptance-rate dynamics, not merely by choosing a smaller draft model.
Mechanistic Interpretability Needs Negative Results
Failed grokking runs are not noise; they can expose representation capacity boundaries when paired with the right spectral and causal diagnostics.