Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Global research index

Search every project, paper, benchmark, note, experiment, and article

The local index works immediately. After static export, Pagefind indexes the generated HTML for production search on Vercel.

Research2026

Grokking Beyond Addition: Circuit-Level Analysis of Algebraic Learning in Transformers

This work extends grokking analysis beyond modular addition to eight algebraic operations across abelian fields, a composite ring, and non-abelian groups. A controlled transformer setup isolates when memorized algorithms become reusable circuits and when representation complexity blocks generalization.

Mechanistic InterpretabilityTransformersGrokkingRepresentation Geometry
Research2026

Adaptive Tensor-Network Compression of LLMs: An Extension of CompactifAI

This project reproduces and extends CompactifAI-style tensor-network compression on real open-weight LLMs. It profiles layer sensitivity, replaces uniform bond dimensions with adaptive schedules, and evaluates healing runs across standard language benchmarks.

Model CompressionTensor NetworksMPOQuantizationLLM Evaluation
System2026

FlashAttention-2 CUDA Kernel

Custom IO-aware GPU attention engine. Rebuild attention from the memory hierarchy upward and understand exactly where framework kernels spend bandwidth, registers, shared memory, and occupancy.

CUDAAttentionKernel EngineeringInference Optimization
System2024-2025

Project Chimera

700M hybrid Mamba-2 and Transformer LLM trained from scratch. Own the full path from tokenizer and architecture decisions through pretraining, reasoning fine-tuning, alignment, quantization, and local inference.

LLM TrainingMambaTransformersAlignmentTokenizer
System2025

Speculative Decoding Runtime

OpenAI-compatible draft-verifier inference server. Increase tokens per second without changing the target model distribution, while keeping the serving API compatible with existing OpenAI-style clients.

Inference SystemsSpeculative DecodingCPU InferenceServing
Open Source2025

vLLM: PR #38816

Disaggregated prefill pipeline hang caused decode nodes to miss KV cache tensors. Implemented request-ID normalization at the prefill-decode boundary, refactored KV cache lookup semantics, and added targeted tests for matched and mismatched ID formats. Resolved indefinite hangs in distributed inference deployments and improved reliability for high-throughput disaggregated serving.

vLLMDistributed InferenceKV CacheReliability
Benchmark

Speculative decoding

2.4x measured result in Mani Pal's research lab index.

BenchmarkPerformance
Benchmark

FlashAttention-2 kernel

2.1x measured result in Mani Pal's research lab index.

BenchmarkPerformance
Benchmark

Sparse MoE

2.3x measured result in Mani Pal's research lab index.

BenchmarkPerformance
Benchmark

API p95 reduction

35% measured result in Mani Pal's research lab index.

BenchmarkPerformance
Benchmark

MPO memory reduction

93% measured result in Mani Pal's research lab index.

BenchmarkPerformance
Experiment2026-03

Non-Abelian Grokking Capacity Ceiling

Can the same one-layer transformer that groks abelian operations grok non-abelian groups under longer training? Longer training alone did not cross the boundary; the failure is likely capacity or representation-geometry constrained rather than an optimizer patience issue.

GrokkingInterpretabilityCapacity
System2025

Sparse MoE Scaling

Top-k routing, z-loss load balancing, and expert collapse analysis. Reproduce the practical failure modes of sparse MoE training, especially expert-utilization collapse, under a controlled matched-FLOP benchmark.

MoEDistributed TrainingRoutingScaling Studies
System2024

VAANI

Hindi-first fully offline voice assistant. Build a local voice assistant that keeps speech, reasoning, and synthesis offline while preserving practical latency on consumer CPU hardware.

Offline AIVoice SystemsHindiEdge Inference
Research2026

Credit Assignment in Spiking Neural Networks: Bridging Bioplausibility and Scalability

This investigation studies how scalable gradient methods and biologically plausible local learning rules diverge when training recurrent spiking neural networks on temporal tasks.

Spiking Neural NetworksCredit AssignmentSurrogate GradientsOnline Learning
Experiment2026-02

Uniform MPO Compression Collapse

Does a single global bond dimension preserve quality across all transformer layers? Layer sensitivity is too uneven for uniform MPO schedules to be the final compression policy.

Tensor NetworksCompressionMMLU
Experiment2025-11

Fixed-Gamma Speculative Decoding

Is a fixed draft lookahead window enough for CPU speculative decoding? Fixed gamma creates brittle prompt-family dependence.

Speculative DecodingInferenceBenchmarking
Experiment2025-09

Sparse MoE Router Entropy

Can routing entropy predict expert collapse before validation loss exposes it? Routing entropy is a leading diagnostic for MoE health.

MoERoutingz-loss
Experiment2026-04

Activation Patching for Algebraic Circuits

Can causal patching separate memorized lookup behavior from algorithmic circuit behavior? This is the next causal validation layer for the grokking study.

Activation PatchingGrokkingMechanistic Interpretability
Experiment2025-08

KV Cache ID Normalization Tests

Can a unit-level reproduction catch prefill/decode request-ID mismatches before deployment hangs? Distributed inference bugs need contract tests around identifiers and transfer semantics.

vLLMKV CacheDistributed Serving
Writing2026-01

Transformer Internals as a Systems Interface

The residual stream is the real systems interface of a transformer: training, inference, interpretability, and compression all negotiate with it.

TransformersResidual StreamInterpretability
Writing2026-01

Attention Mechanisms Under IO Pressure

The useful mental model for modern attention kernels is not the softmax equation; it is the path data takes through HBM, SRAM, registers, and warps.

AttentionFlashAttentionCUDA
Writing2026-02

CUDA Optimization Notes from an Attention Kernel

CUDA optimization is the discipline of making memory motion, register pressure, and occupancy legible enough to trade them deliberately.

CUDANsightKernel Engineering
Writing2025-12

Mamba Architectures in Hybrid LLM Training

Hybrid SSM-attention models are best treated as architectural experiments whose evaluation must cover long-context behavior, tokenizer behavior, and deployment cost together.

MambaSSMLLM Training
Writing2025-10

Sparse Models Fail Quietly Before They Fail Loudly

Sparse MoE systems can look healthy on loss curves while the router is already collapsing. Entropy and load metrics need to be first-class.

MoERouting EntropyScaling
Writing2025-11

Inference Systems Are Acceptance-Rate Control Problems

Speculative decoding speedup is controlled by acceptance-rate dynamics, not merely by choosing a smaller draft model.

Speculative DecodingServingLatency
Writing2026-03

Mechanistic Interpretability Needs Negative Results

Failed grokking runs are not noise; they can expose representation capacity boundaries when paired with the right spectral and causal diagnostics.

InterpretabilityGrokkingNegative Results