Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Research dashboard

Living research lab for frontier AI systems

Mani Pal is an engineer-researcher working across LLM systems, inference optimization, CUDA kernel engineering, mechanistic interpretability, model compression, and distributed AI infrastructure.

01

700M LLM from scratch

Hybrid Mamba-2 and Transformer model with GRPO, DPO, GGUF, and CPU inference logs.

02

FlashAttention-2 CUDA kernels

Tiled IO-aware attention kernel profiled at 2.1x over PyTorch SDPA on A100.

03

2.4x speculative decoding

Draft-verifier runtime with temperature-corrected rejection sampling and adaptive gamma.

04

CompactifAI extension

Adaptive tensor-network compression with layer sensitivity profiling and policy search.

05

vLLM contribution

Disaggregated prefill KV cache request-ID bug fix with production reliability impact.

06

Published interpretability research

Circuit-level grokking study published on Zenodo and prepared for arXiv submission.

Published Research

3

Mechanistic interpretability, tensor-network compression, and SNN credit assignment.

Open Source Contributions

vLLM

Distributed inference reliability patch in disaggregated prefill KV transfer.

Systems Built

5

LLM training, CUDA kernels, inference runtimes, sparse MoE, and offline voice AI.

Benchmarks

2.4x

Speculative decoding speedup, 2.1x attention throughput, 93% memory reduction.

Current Investigations

7

Open questions across compression, routing collapse, grokking, and online learning.

Timeline

Milestones

Research, systems, benchmarks, and open-source across the lab record.

2023

System

Production RAG and model adaptation

Delivered legal-tech RAG and QLoRA fine-tuning work before focusing fully on LLM systems.

2024

System

Project Chimera begins

Designed a 700M hybrid Mamba-2 and Transformer LLM, tokenizer, training schedule, and evaluation path.

System

VAANI offline assistant

Built a Hindi-first offline voice stack with wake word, ASR, local LLM, and TTS under 800ms.

2025

Benchmark

Speculative decoding reaches 2.4x

Implemented draft-verifier inference with adaptive lookahead and distribution-preserving sampling.

Open Source

vLLM disaggregated prefill patch

Fixed request-ID mismatch across prefill and decode nodes causing KV cache transfer hangs.

Benchmark

Sparse MoE scaling run

Implemented top-k routing, z-loss balancing, entropy logging, and matched-FLOP dense baselines.

2026

Research

Grokking Beyond Addition published

Circuit-level study across abelian and non-abelian algebraic operations published on Zenodo.

Benchmark

FlashAttention-2 kernel profiled

CUDA C++ tiled attention kernel reached 2.1x throughput over PyTorch SDPA at sequence length 4096.

Research

CompactifAI extension completed

Extended tensor-network compression with adaptive bond-dimension scheduling and model healing.

Benchmarks

Measured systems signals

Speculative decoding2.4x
FlashAttention-2 kernel2.1x
Sparse MoE2.3x
API p95 reduction35%
MPO memory reduction93%

Systems map

Training, compression, kernels, serving, evaluation

View case studies

Current investigations

Internal notebook surface

Open notebook
Failed Experimentsfailed

Non-Abelian Grokking Capacity Ceiling

Can the same one-layer transformer that groks abelian operations grok non-abelian groups under longer training?

Compression Studiesreproduced

Uniform MPO Compression Collapse

Does a single global bond dimension preserve quality across all transformer layers?

Benchmark Logsarchived

Fixed-Gamma Speculative Decoding

Is a fixed draft lookahead window enough for CPU speculative decoding?

Scaling Studiesreproduced

Sparse MoE Router Entropy

Can routing entropy predict expert collapse before validation loss exposes it?