Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Systems

Engineering case studies, not project blurbs

Each system is structured around motivation, design constraints, architecture, bottlenecks, optimization decisions, benchmark methodology, results, and lessons learned.

active2026

FlashAttention-2 CUDA Kernel

Custom IO-aware GPU attention engine

Rebuild attention from the memory hierarchy upward and understand exactly where framework kernels spend bandwidth, registers, shared memory, and occupancy.

Throughput

2.1x

Versus PyTorch SDPA on A100, seqlen 4096.

Tile

64x64

Q block and K/V block shape.

Memory

O(N)

Streaming residency rather than O(N^2) attention storage.

CUDAAttentionKernel EngineeringInference Optimization
active2024-2025

Project Chimera

700M hybrid Mamba-2 and Transformer LLM trained from scratch

Own the full path from tokenizer and architecture decisions through pretraining, reasoning fine-tuning, alignment, quantization, and local inference.

Parameters

700M

Hybrid Mamba-2 and Transformer model.

Artifact

4.2GB

INT4 GGUF quantized checkpoint.

Latency

<3s

First token on CPU-only inference.

LLM TrainingMambaTransformersAlignmentTokenizer
active2025

Speculative Decoding Runtime

OpenAI-compatible draft-verifier inference server

Increase tokens per second without changing the target model distribution, while keeping the serving API compatible with existing OpenAI-style clients.

Speedup

2.4x

Mean CPU tokens per second versus verifier-only generation.

Draft

0.5B

Qwen2.5 draft model.

Verifier

3B

Qwen2.5 verifier model.

Inference SystemsSpeculative DecodingCPU InferenceServing
reproduced2025

Sparse MoE Scaling

Top-k routing, z-loss load balancing, and expert collapse analysis

Reproduce the practical failure modes of sparse MoE training, especially expert-utilization collapse, under a controlled matched-FLOP benchmark.

Throughput

2.3x

Sparse versus dense matched-FLOP baseline.

Experts

8

Top-2 sparse routing.

Active params

125M

Active parameter budget per token.

MoEDistributed TrainingRoutingScaling Studies
active2024

VAANI

Hindi-first fully offline voice assistant

Build a local voice assistant that keeps speech, reasoning, and synthesis offline while preserving practical latency on consumer CPU hardware.

Latency

<800ms

Wake to spoken response on CPU.

Network

0

No internet dependency.

Plugins

8

Layered extension architecture.

Offline AIVoice SystemsHindiEdge Inference