Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

System case study / 2025

Speculative Decoding Runtime

OpenAI-compatible draft-verifier inference server

activePython / PyTorch / llama.cpp / GGUF / FastAPI
Inference SystemsSpeculative DecodingCPU InferenceServing

Speedup

2.4x

Mean CPU tokens per second versus verifier-only generation.

Draft

0.5B

Qwen2.5 draft model.

Verifier

3B

Qwen2.5 verifier model.

Motivation

Increase tokens per second without changing the target model distribution, while keeping the serving API compatible with existing OpenAI-style clients.

Design Constraints

  • Maintain mathematically identical output distribution under acceptance and rejection sampling.
  • Pair a small draft model with a stronger verifier.
  • Work on CPU-friendly quantized GGUF inference.
  • Expose a /v1/completions-compatible FastAPI endpoint.

System Architecture

  • Qwen2.5-0.5B draft model proposes a gamma-length continuation.
  • Qwen2.5-3B verifier scores the proposal tokens.
  • Temperature-corrected rejection sampling accepts or repairs the draft path.
  • Adaptive gamma scheduling adjusts lookahead from observed acceptance rate.

Performance Bottlenecks

  • Verifier calls dominate latency when acceptance rate falls.
  • High-entropy text reduces the value of long draft lookahead.
  • CPU quantized inference requires careful batching and memory reuse.
  • API compatibility restricts how much state can leak into client contracts.

Optimization Decisions

  • Use adaptive gamma instead of a fixed lookahead window.
  • Specialize for low-entropy outputs such as code and JSON while degrading gracefully.
  • Keep sampling correction explicit and testable.
  • Record acceptance traces per prompt family.

Benchmark Methodology

  • Measured mean tokens per second over code, JSON, and natural-language prompts.
  • Compared against verifier-only generation.
  • Validated distribution preservation by checking rejection-sampling paths.
  • Tracked acceptance rate against selected gamma windows.

Results

  • Achieved 2.4x mean tokens-per-second speedup on CPU.
  • Maintained identical output distribution under the target model sampling rule.
  • Delivered an OpenAI-compatible completions endpoint.
  • Showed graceful degradation on high-entropy text.

Lessons Learned

  • Speculative decoding is a control problem around acceptance rate, not just a two-model trick.
  • Low-entropy workloads are where adaptive gamma produces the most predictable returns.
  • Distribution-correct rejection sampling should be visible in tests rather than implied.