LLM systems · CUDA kernels · Inference

LLM inference, made measurably faster.

I work where the model meets the metal: attention kernels, speculative decoding, KV-cache behavior, compression, and the serving stack around them. Every claim on this site links to code, profiling traces, and benchmarks.

Explore the systems Hire me for a sprint

Attention throughput: 0.0×
Decoding speedup: 0.0×
Memory reduction: 0%
LLM from scratch: 0M

fig. 01 — tiled causal attention sweep

live

Br=64 · Bc=64 · SRAM-resident · HBM traffic O(N²) → O(N)read the kernel write-up →

45-second signals

Six things worth your scroll

700M LLM from scratch

Hybrid Mamba-2 and Transformer model with GRPO, DPO, GGUF, and CPU inference logs.

FlashAttention-2 CUDA kernels

Tiled IO-aware attention kernel profiled at 2.1x over PyTorch SDPA on A100.

2.4x speculative decoding

Draft-verifier runtime with temperature-corrected rejection sampling and adaptive gamma.

CompactifAI extension

Adaptive tensor-network compression with layer sensitivity profiling and policy search.

vLLM contribution

Disaggregated prefill KV cache request-ID bug fix with production reliability impact.

Published interpretability research

Circuit-level grokking study published on Zenodo and prepared for arXiv submission.

Published Research

Mechanistic interpretability, tensor-network compression, and SNN credit assignment.

Open Source Contributions

vLLM

Distributed inference reliability patch in disaggregated prefill KV transfer.

Systems Built

LLM training, CUDA kernels, inference runtimes, sparse MoE, and offline voice AI.

Benchmarks

2.4x

Speculative decoding speedup, 2.1x attention throughput, 93% memory reduction.

Current Investigations

Open questions across compression, routing collapse, grokking, and online learning.

Timeline

Milestones

Research, systems, benchmarks, and open-source across the lab record.

2023

System

Production RAG and model adaptation

Delivered legal-tech RAG and QLoRA fine-tuning work before focusing fully on LLM systems.

2024

System

Project Chimera begins

Designed a 700M hybrid Mamba-2 and Transformer LLM, tokenizer, training schedule, and evaluation path.

System

VAANI offline assistant

Built a Hindi-first offline voice stack with wake word, ASR, local LLM, and TTS under 800ms.

2025

Benchmark

Speculative decoding reaches 2.4x

Implemented draft-verifier inference with adaptive lookahead and distribution-preserving sampling.

Open Source

vLLM disaggregated prefill patch

Fixed request-ID mismatch across prefill and decode nodes causing KV cache transfer hangs.

Benchmark

Sparse MoE scaling run

Implemented top-k routing, z-loss balancing, entropy logging, and matched-FLOP dense baselines.

2026

Research

Grokking Beyond Addition published

Circuit-level study across abelian and non-abelian algebraic operations published on Zenodo.

Benchmark

FlashAttention-2 kernel profiled

CUDA C++ tiled attention kernel reached 2.1x throughput over PyTorch SDPA at sequence length 4096.

Research

CompactifAI extension completed

Extended tensor-network compression with adaptive bond-dimension scheduling and model healing.

Benchmarks

Measured systems signals

Speculative decoding2.4x

FlashAttention-2 kernel2.1x

Sparse MoE2.3x

API p95 reduction35%

MPO memory reduction93%

Systems map

Training, compression, kernels, serving, evaluation

View case studies

React Flow

Current investigations

Internal notebook surface

Open notebook

Failed Experimentsfailed

Non-Abelian Grokking Capacity Ceiling

Can the same one-layer transformer that groks abelian operations grok non-abelian groups under longer training?

Compression Studiesreproduced

Uniform MPO Compression Collapse

Does a single global bond dimension preserve quality across all transformer layers?

Benchmark Logsarchived

Fixed-Gamma Speculative Decoding

Is a fixed draft lookahead window enough for CPU speculative decoding?

Scaling Studiesreproduced

Sparse MoE Router Entropy

Can routing entropy predict expert collapse before validation loss exposes it?

Contract engagements

Serving models in production and the GPU bill hurts?

Fixed-scope inference optimization sprints with a benchmark target agreed up front — and a guarantee behind it.

See engagements

Mani Pal

LLM inference, made measurably faster.

Six things worth your scroll

700M LLM from scratch

FlashAttention-2 CUDA kernels

2.4x speculative decoding

CompactifAI extension

vLLM contribution

Published interpretability research

Milestones

Production RAG and model adaptation

Project Chimera begins

VAANI offline assistant

Speculative decoding reaches 2.4x

vLLM disaggregated prefill patch

Sparse MoE scaling run

Grokking Beyond Addition published

FlashAttention-2 kernel profiled

CompactifAI extension completed

Measured systems signals

Training, compression, kernels, serving, evaluation

Internal notebook surface

Non-Abelian Grokking Capacity Ceiling

Uniform MPO Compression Collapse

Fixed-Gamma Speculative Decoding

Sparse MoE Router Entropy

Serving models in production and the GPU bill hurts?