Systems
Engineering case studies, not project blurbs
Each system is structured around motivation, design constraints, architecture, bottlenecks, optimization decisions, benchmark methodology, results, and lessons learned.
FlashAttention-2 CUDA Kernel
Custom IO-aware GPU attention engine
Rebuild attention from the memory hierarchy upward and understand exactly where framework kernels spend bandwidth, registers, shared memory, and occupancy.
Throughput
2.1x
Versus PyTorch SDPA on A100, seqlen 4096.
Tile
64x64
Q block and K/V block shape.
Memory
O(N)
Streaming residency rather than O(N^2) attention storage.
Project Chimera
700M hybrid Mamba-2 and Transformer LLM trained from scratch
Own the full path from tokenizer and architecture decisions through pretraining, reasoning fine-tuning, alignment, quantization, and local inference.
Parameters
700M
Hybrid Mamba-2 and Transformer model.
Artifact
4.2GB
INT4 GGUF quantized checkpoint.
Latency
<3s
First token on CPU-only inference.
Speculative Decoding Runtime
OpenAI-compatible draft-verifier inference server
Increase tokens per second without changing the target model distribution, while keeping the serving API compatible with existing OpenAI-style clients.
Speedup
2.4x
Mean CPU tokens per second versus verifier-only generation.
Draft
0.5B
Qwen2.5 draft model.
Verifier
3B
Qwen2.5 verifier model.
Sparse MoE Scaling
Top-k routing, z-loss load balancing, and expert collapse analysis
Reproduce the practical failure modes of sparse MoE training, especially expert-utilization collapse, under a controlled matched-FLOP benchmark.
Throughput
2.3x
Sparse versus dense matched-FLOP baseline.
Experts
8
Top-2 sparse routing.
Active params
125M
Active parameter budget per token.
VAANI
Hindi-first fully offline voice assistant
Build a local voice assistant that keeps speech, reasoning, and synthesis offline while preserving practical latency on consumer CPU hardware.
Latency
<800ms
Wake to spoken response on CPU.
Network
0
No internet dependency.
Plugins
8
Layered extension architecture.