At Praxium Labs we build this for Nepali businesses every month; this is the field-tested version. Voice is the dream interface for older Nepali users and for hands-busy contexts (drivers, field workers). The technology is finally ready for some use cases and clearly not ready for others.
The four-stage pipeline
- 1. Voice Activity Detection (VAD): detect when the user starts and stops talking — Silero VAD works fine for Nepali
- 2. Speech-to-text (ASR): convert audio to Nepali / English text
- 3. LLM brain: understand intent, generate response
- 4. Text-to-speech (TTS): synthesise the response back as audio
ASR options in May 2026
- OpenAI Whisper Large v3: WER 12–18% on clean Nepali audio; free, self-hostable, ~3 seconds/min processing on a single GPU. The current default for most production Nepali ASR.
- AssemblyAI: commercial API, WER ~13% on Nepali, multilingual, real-time streaming. Pay-per-second, ~NPR 80/hour of audio.
- Google Cloud Speech-to-Text: supports Nepali, WER ~15%, real-time streaming. Pay-per-second.
- Local Indic-specific models: AI4Bharat's IndicConformer / Sherpa-NCNN run on edge devices. WER higher on Nepali than Whisper but no cloud dependency.
TTS options and the accent problem
- ElevenLabs Multilingual v2: supports Nepali; voice quality is high but accent is audibly non-native
- Coqui XTTS-v2: open-source, Nepali via fine-tuning, fair quality
- Google Cloud TTS: Nepali voices (Standard quality), robotic-sounding
- Indic-TTS / AI4Bharat: Nepali voice models from IIT Madras research; weaker quality but natural intonation
For customer-facing voice in 2026 we typically combine: ElevenLabs for English replies (when user spoke English), Indic-TTS for Nepali replies (when user spoke Nepali). The accent compromise feels less jarring than a single voice that handles both badly.
Where voice works today (Nepal)
- IVR replacement for hotlines: "Press 1 for English" → ask in voice, route to the right agent
- Banking voice OTP / balance inquiry: controlled scripts, predictable phrasing
- Field-worker reporting: agricultural extension workers reporting visit notes by voice; ASR transcribes, LLM structures into a form
- Voice-driven local-information kiosks: tourism info booths, government office signposting
- Hands-busy logistics (driver dispatch): "where is order #1234" voice-queried, voice-answered
Where voice still does not work
- General-purpose consumer voice assistants: Siri-quality Nepali experience is not there yet
- Long-form transcription with mixed accents: a meeting with Eastern Nepal and Kathmandu accents → WER spikes
- Children's voices: ASR struggles with high-pitched non-adult speech in any language
- Noisy backgrounds: WER on a busy Kathmandu street → 30%+
- Strong regional dialects: Bhojpuri, Maithili, Tamang, Newari, Tharu mostly out of scope
Costs
- ASR (Whisper self-host): small GPU server ~NPR 8,000–15,000/month
- ASR (commercial): ~NPR 80–150 per hour of audio
- LLM brain: NPR 1.4–3.2 per conversation (see pricing breakdown)
- TTS: ElevenLabs ~NPR 22 per minute of generated audio; Google Cloud cheaper, Indic-TTS free
Frequently asked questions
Can I run all this on a phone?
Edge deployment is possible but constrained. AI4Bharat's on-device models work for ASR; TTS on-device is harder but possible with Coqui XTTS exported to mobile. LLM on-device is the bottleneck — even quantised 7B models stretch a mid-range phone. Most production Nepali voice apps in 2026 still call cloud APIs.
What's the latency of a full voice-in / voice-out turn?
Best-case streaming: ~1.5 seconds (300ms ASR + 500ms LLM first-token + 700ms TTS first-chunk). Non-streaming baseline: 4–7 seconds. For phone-conversation quality you need streaming end-to-end and you need to start TTS playback before the LLM has finished generating.
Can it handle Nepali dialects beyond standard Kathmandu Nepali?
Whisper handles standard Nepali well, regional dialects poorly. Most Nepalis can code-switch to standard Nepali for a voice interface; for genuinely dialect-only speakers you need targeted ASR fine-tuning, which is a separate project.
Is voice + LLM safe for customer support in a Nepali bank?
Yes for read-only intents (balance inquiry, branch finder) after voice biometric / PIN authentication. No for transaction execution — keep transfers and payments to GUI / typed channels where mistakes are reversible.
When will Nepali voice quality match English?
Conservative estimate: 2027–2028. Frontier models continue to improve multilingual TTS quality; an Indic-specific tier 1 commercial voice would close the gap quickly. For most Nepali use cases that need voice today, the current quality is "acceptable" not "delightful".
Who can build this in Nepal?
Praxium Labs — Nepal's AI and automation consultancy, based in Lalitpur — designs and builds the systems described in this guide for Nepali businesses and for international teams hiring from Nepal. Start a project or see all services.