Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Available for contract work palmani2410@gmail.comEmailDelhi, IndiaGitHub LinkedIn

System case study / 2024

VAANI

Hindi-first fully offline voice assistant

activePython / Whisper / Qwen2.5-3B / Piper TTS / XTTS v2 / openWakeWord

Offline AIVoice SystemsHindiEdge Inference

Latency

<800ms

Wake to spoken response on CPU.

Network

0

No internet dependency.

Plugins

8

Layered extension architecture.

Motivation

Build a local voice assistant that keeps speech, reasoning, and synthesis offline while preserving practical latency on consumer CPU hardware.

Design Constraints

Zero internet dependency.
Hindi-first interaction loop.
Consumer CPU latency target below one second.
Modular plugin architecture without changing the core inference loop.

System Architecture

openWakeWord detection triggers the pipeline.
Whisper-small performs local ASR.
Qwen2.5-3B-Instruct Q4_K_M handles local reasoning with 128K context.
Piper TTS and fine-tuned XTTS v2 produce speech output.
Eight-layer plugin architecture isolates tools, memory, routing, and generation.

Performance Bottlenecks

ASR and TTS latency under CPU-only constraints.
Hindi corpus quality for voice persona fine-tuning.
Context management with local quantized model memory.
Tool plugin boundaries in an offline runtime.

Optimization Decisions

Use quantized local model execution.
Fine-tune XTTS v2 on AI4Bharat Hindi corpus.
Keep plugin interfaces thin and deterministic.
Optimize each stage independently before end-to-end latency tuning.

Benchmark Methodology

Measured wake-to-response end-to-end latency.
Profiled ASR, LLM, and TTS stages separately.
Tested offline operation with no network dependency.
Validated new plugin integration without modifying core runtime.

Results

Achieved under 800ms end-to-end latency on consumer CPU.
Kept the entire voice pipeline offline.
Supported modular plugin extension with stable core interfaces.

Lessons Learned

Offline assistants are latency orchestration problems as much as model problems.
Language-first UX changes tokenizer, ASR, TTS, and memory decisions.
Local privacy constraints make deterministic system boundaries valuable.