Systems

Engineering case studies, not project blurbs

Each system is structured around motivation, design constraints, architecture, bottlenecks, optimization decisions, benchmark methodology, results, and lessons learned.

active2026

FlashAttention-2 CUDA Kernel

Custom IO-aware GPU attention engine

Rebuild attention from the memory hierarchy upward and understand exactly where framework kernels spend bandwidth, registers, shared memory, and occupancy.

Throughput

2.1x

Versus PyTorch SDPA on A100, seqlen 4096.

Tile

64x64

Q block and K/V block shape.

Memory

O(N)

Streaming residency rather than O(N^2) attention storage.

CUDAAttentionKernel EngineeringInference Optimization

active2024-2025

Project Chimera

700M hybrid Mamba-2 and Transformer LLM trained from scratch

Own the full path from tokenizer and architecture decisions through pretraining, reasoning fine-tuning, alignment, quantization, and local inference.

Parameters

700M

Hybrid Mamba-2 and Transformer model.

Artifact

4.2GB

INT4 GGUF quantized checkpoint.

Latency

<3s

First token on CPU-only inference.

LLM TrainingMambaTransformersAlignmentTokenizer

active2025

Speculative Decoding Runtime

OpenAI-compatible draft-verifier inference server

Increase tokens per second without changing the target model distribution, while keeping the serving API compatible with existing OpenAI-style clients.

Speedup

2.4x

Mean CPU tokens per second versus verifier-only generation.

Draft

0.5B

Qwen2.5 draft model.

Verifier

Qwen2.5 verifier model.

Inference SystemsSpeculative DecodingCPU InferenceServing

reproduced2025