System
Production RAG and model adaptation
Delivered legal-tech RAG and QLoRA fine-tuning work before focusing fully on LLM systems.
Engineer-researcher
LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.
Research dashboard
Mani Pal is an engineer-researcher working across LLM systems, inference optimization, CUDA kernel engineering, mechanistic interpretability, model compression, and distributed AI infrastructure.
Hybrid Mamba-2 and Transformer model with GRPO, DPO, GGUF, and CPU inference logs.
Tiled IO-aware attention kernel profiled at 2.1x over PyTorch SDPA on A100.
Draft-verifier runtime with temperature-corrected rejection sampling and adaptive gamma.
Adaptive tensor-network compression with layer sensitivity profiling and policy search.
Disaggregated prefill KV cache request-ID bug fix with production reliability impact.
Circuit-level grokking study published on Zenodo and prepared for arXiv submission.
Published Research
3
Mechanistic interpretability, tensor-network compression, and SNN credit assignment.
Open Source Contributions
vLLM
Distributed inference reliability patch in disaggregated prefill KV transfer.
Systems Built
5
LLM training, CUDA kernels, inference runtimes, sparse MoE, and offline voice AI.
Benchmarks
2.4x
Speculative decoding speedup, 2.1x attention throughput, 93% memory reduction.
Current Investigations
7
Open questions across compression, routing collapse, grokking, and online learning.
Timeline
Research, systems, benchmarks, and open-source across the lab record.
2023
System
Delivered legal-tech RAG and QLoRA fine-tuning work before focusing fully on LLM systems.
2024
System
Designed a 700M hybrid Mamba-2 and Transformer LLM, tokenizer, training schedule, and evaluation path.
System
Built a Hindi-first offline voice stack with wake word, ASR, local LLM, and TTS under 800ms.
2025
Benchmark
Implemented draft-verifier inference with adaptive lookahead and distribution-preserving sampling.
Open Source
Fixed request-ID mismatch across prefill and decode nodes causing KV cache transfer hangs.
Benchmark
Implemented top-k routing, z-loss balancing, entropy logging, and matched-FLOP dense baselines.
2026
Research
Circuit-level study across abelian and non-abelian algebraic operations published on Zenodo.
Benchmark
CUDA C++ tiled attention kernel reached 2.1x throughput over PyTorch SDPA at sequence length 4096.
Research
Extended tensor-network compression with adaptive bond-dimension scheduling and model healing.
Benchmarks
Systems map
Current investigations
Can the same one-layer transformer that groks abelian operations grok non-abelian groups under longer training?
Does a single global bond dimension preserve quality across all transformer layers?
Is a fixed draft lookahead window enough for CPU speculative decoding?
Can routing entropy predict expert collapse before validation loss exposes it?