Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

System case study / 2024

VAANI

Hindi-first fully offline voice assistant

activePython / Whisper / Qwen2.5-3B / Piper TTS / XTTS v2 / openWakeWord
Offline AIVoice SystemsHindiEdge Inference

Latency

<800ms

Wake to spoken response on CPU.

Network

0

No internet dependency.

Plugins

8

Layered extension architecture.

Motivation

Build a local voice assistant that keeps speech, reasoning, and synthesis offline while preserving practical latency on consumer CPU hardware.

Design Constraints

  • Zero internet dependency.
  • Hindi-first interaction loop.
  • Consumer CPU latency target below one second.
  • Modular plugin architecture without changing the core inference loop.

System Architecture

  • openWakeWord detection triggers the pipeline.
  • Whisper-small performs local ASR.
  • Qwen2.5-3B-Instruct Q4_K_M handles local reasoning with 128K context.
  • Piper TTS and fine-tuned XTTS v2 produce speech output.
  • Eight-layer plugin architecture isolates tools, memory, routing, and generation.

Performance Bottlenecks

  • ASR and TTS latency under CPU-only constraints.
  • Hindi corpus quality for voice persona fine-tuning.
  • Context management with local quantized model memory.
  • Tool plugin boundaries in an offline runtime.

Optimization Decisions

  • Use quantized local model execution.
  • Fine-tune XTTS v2 on AI4Bharat Hindi corpus.
  • Keep plugin interfaces thin and deterministic.
  • Optimize each stage independently before end-to-end latency tuning.

Benchmark Methodology

  • Measured wake-to-response end-to-end latency.
  • Profiled ASR, LLM, and TTS stages separately.
  • Tested offline operation with no network dependency.
  • Validated new plugin integration without modifying core runtime.

Results

  • Achieved under 800ms end-to-end latency on consumer CPU.
  • Kept the entire voice pipeline offline.
  • Supported modular plugin extension with stable core interfaces.

Lessons Learned

  • Offline assistants are latency orchestration problems as much as model problems.
  • Language-first UX changes tokenizer, ASR, TTS, and memory decisions.
  • Local privacy constraints make deterministic system boundaries valuable.