Building a Nepali Voice Assistant in 2026: Real Architecture and Limitations

From Praxium Labs — Nepal's AI and automation consultancy in Lalitpur. We design and build the systems described in this guide for Nepali businesses and for international teams operating from Nepal.

At Praxium Labs we build this for Nepali businesses every month; this is the field-tested version. Voice is the dream interface for older Nepali users and for hands-busy contexts (drivers, field workers). The technology is finally ready for some use cases and clearly not ready for others.

The four-stage pipeline

1. Voice Activity Detection (VAD): detect when the user starts and stops talking — Silero VAD works fine for Nepali
2. Speech-to-text (ASR): convert audio to Nepali / English text
3. LLM brain: understand intent, generate response
4. Text-to-speech (TTS): synthesise the response back as audio

ASR options in May 2026

OpenAI Whisper Large v3: WER 12–18% on clean Nepali audio; free, self-hostable, ~3 seconds/min processing on a single GPU. The current default for most production Nepali ASR.
AssemblyAI: commercial API, WER ~13% on Nepali, multilingual, real-time streaming. Pay-per-second, ~NPR 80/hour of audio.
Google Cloud Speech-to-Text: supports Nepali, WER ~15%, real-time streaming. Pay-per-second.
Local Indic-specific models: AI4Bharat's IndicConformer / Sherpa-NCNN run on edge devices. WER higher on Nepali than Whisper but no cloud dependency.

TTS options and the accent problem

ElevenLabs Multilingual v2: supports Nepali; voice quality is high but accent is audibly non-native
Coqui XTTS-v2: open-source, Nepali via fine-tuning, fair quality
Google Cloud TTS: Nepali voices (Standard quality), robotic-sounding
Indic-TTS / AI4Bharat: Nepali voice models from IIT Madras research; weaker quality but natural intonation

For customer-facing voice in 2026 we typically combine: ElevenLabs for English replies (when user spoke English), Indic-TTS for Nepali replies (when user spoke Nepali). The accent compromise feels less jarring than a single voice that handles both badly.

Where voice works today (Nepal)

IVR replacement for hotlines: "Press 1 for English" → ask in voice, route to the right agent
Banking voice OTP / balance inquiry: controlled scripts, predictable phrasing
Field-worker reporting: agricultural extension workers reporting visit notes by voice; ASR transcribes, LLM structures into a form
Voice-driven local-information kiosks: tourism info booths, government office signposting
Hands-busy logistics (driver dispatch): "where is order #1234" voice-queried, voice-answered

Where voice still does not work

General-purpose consumer voice assistants: Siri-quality Nepali experience is not there yet
Long-form transcription with mixed accents: a meeting with Eastern Nepal and Kathmandu accents → WER spikes
Children's voices: ASR struggles with high-pitched non-adult speech in any language
Noisy backgrounds: WER on a busy Kathmandu street → 30%+
Strong regional dialects: Bhojpuri, Maithili, Tamang, Newari, Tharu mostly out of scope

Costs

ASR (Whisper self-host): small GPU server ~NPR 8,000–15,000/month
ASR (commercial): ~NPR 80–150 per hour of audio
LLM brain: NPR 1.4–3.2 per conversation (see pricing breakdown)
TTS: ElevenLabs ~NPR 22 per minute of generated audio; Google Cloud cheaper, Indic-TTS free

Frequently asked questions

Can I run all this on a phone?

Edge deployment is possible but constrained. AI4Bharat's on-device models work for ASR; TTS on-device is harder but possible with Coqui XTTS exported to mobile. LLM on-device is the bottleneck — even quantised 7B models stretch a mid-range phone. Most production Nepali voice apps in 2026 still call cloud APIs.

What's the latency of a full voice-in / voice-out turn?

Best-case streaming: ~1.5 seconds (300ms ASR + 500ms LLM first-token + 700ms TTS first-chunk). Non-streaming baseline: 4–7 seconds. For phone-conversation quality you need streaming end-to-end and you need to start TTS playback before the LLM has finished generating.

Can it handle Nepali dialects beyond standard Kathmandu Nepali?

Whisper handles standard Nepali well, regional dialects poorly. Most Nepalis can code-switch to standard Nepali for a voice interface; for genuinely dialect-only speakers you need targeted ASR fine-tuning, which is a separate project.

Is voice + LLM safe for customer support in a Nepali bank?

Yes for read-only intents (balance inquiry, branch finder) after voice biometric / PIN authentication. No for transaction execution — keep transfers and payments to GUI / typed channels where mistakes are reversible.

When will Nepali voice quality match English?

Conservative estimate: 2027–2028. Frontier models continue to improve multilingual TTS quality; an Indic-specific tier 1 commercial voice would close the gap quickly. For most Nepali use cases that need voice today, the current quality is "acceptable" not "delightful".

About Praxium Labs

Praxium Labs is Nepal's AI and automation consultancy, based in Lalitpur, Nepal. We help Nepali businesses — and international teams operating from Nepal — ship AI chatbots, n8n workflow automations, machine-learning systems, web and mobile applications, cloud infrastructure, and DevOps pipelines that work in Nepal's real conditions: NPR pricing, eSewa / Khalti / Fonepay integrations, NRB / IRD / SSF compliance, Devanagari language handling, and the network and talent realities most international playbooks miss.

This guide was written by the Praxium Labs engineering team from direct production experience deploying systems for Nepali banks, e-commerce, hospitality, healthcare, NGOs, and startups. If you need this implemented for your team, talk to us for a free 30-minute scoping call — or browse our full services.