Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Inference Systems / 2025-11

Inference Systems Are Acceptance-Rate Control Problems

Speculative decoding speedup is controlled by acceptance-rate dynamics, not merely by choosing a smaller draft model.

13 min

Speculative DecodingServingLatency

Outline

  • Draft-verifier architecture.
  • Distribution-correct rejection sampling.
  • Adaptive gamma scheduling.
  • Prompt families and speedup variance.

References