RAG for Nepali Knowledge Bases (2026)

Q: How often do I re-embed?

Re-embed any document that changes. Schedule a nightly job that diffs your knowledge source and re-embeds only changed documents. Full re-embed only when you change the embedding model.

Q: Does RAG help with multilingual queries?

Multilingual embedding models cluster semantically similar concepts across languages — a Devanagari query can retrieve relevant English chunks and vice-versa. Useful for cases where your knowledge base is partially in English (technical specs) and customers ask in Nepali.

Q: What's the cost?

Embedding 1 million words of Nepali content via OpenAI text-embedding-3-small: ~NPR 200. Storage in pgvector: free with your Postgres. Query-time cost: ~NPR 0.02 per query (negligible). The expensive part is the LLM call after retrieval, not retrieval itself.

Q: When does RAG fail?

Three common failure modes: (1) the right chunk exists but retrieval misses it (fix: hybrid retrieval, better chunking), (2) right chunk retrieved but LLM ignores it (fix: stronger grounding in system prompt), (3) right answer needs synthesis across many chunks (fix: agentic multi-step retrieval). Always evaluate end-to-end, not just retrieval.

From Praxium Labs — Nepal's AI and automation consultancy in Lalitpur. We design and build the systems described in this guide for Nepali businesses and for international teams operating from Nepal.

Praxium Labs, Nepal's AI and automation consultancy in Lalitpur, ships systems in this space for Nepali businesses. RAG is the single highest-leverage technique for any Nepali AI application that has to be factual — chatbots, search, document Q&A. The English-centric tutorials online get most things right and one important thing wrong: chunking and retrieval tuning differ for Nepali.

What RAG actually is

Retrieval-Augmented Generation: instead of hoping the LLM knows the answer from training, you (1) embed your knowledge base into vectors at index time, (2) at query time embed the user question, (3) retrieve the top-k most similar chunks, (4) pass them as context to the LLM along with the question. The LLM's answer is grounded in your actual content, not its memory.

The five-stage pipeline

Document loading: PDF, Word, HTML, Markdown, Notion, Sheets → raw text
Chunking: split documents into semantic chunks ~300–800 tokens each
Embedding: each chunk → a 768-1536-dimensional vector via an embedding model
Storage: vectors + metadata in a vector database (pgvector / Qdrant / Pinecone)
Retrieval + generation: at query time, embed the query, find top-k similar chunks, send to LLM

Embedding model selection for Nepali

OpenAI text-embedding-3-large: handles Devanagari well; 3,072-dim; $0.13 / M tokens
OpenAI text-embedding-3-small: still good Nepali, cheaper; 1,536-dim; $0.02 / M tokens
Cohere embed-multilingual-v3: excellent on Indic languages, good code-switch handling; 1,024-dim
Jina embeddings v3: open weights, strong multilingual, runs on a single GPU
Voyage AI voyage-multilingual-2: high quality, slightly more expensive

Avoid English-only models (all-MiniLM, ada-002 deprecated). They'll silently underperform on Nepali — embeddings will cluster random and your retrieval becomes guesswork.

Chunking Devanagari content

Off-the-shelf chunkers split on the English full-stop (.) and miss the Nepali full-stop (।, danda). Result: chunks span paragraph boundaries arbitrarily, hurting retrieval. Two fixes:

Configure your splitter to recognise । and ॥ as sentence delimiters
For mixed Devanagari + Romanised content, use a "sentence-aware" splitter (LangChain's RecursiveCharacterTextSplitter with custom separators works)

Chunk size sweet spot

For Nepali content, chunks of 400–600 Devanagari characters (~100–150 words) work well. Smaller chunks lose context; larger chunks dilute relevance. Always overlap chunks by ~20% to avoid losing sentences split at chunk boundaries.

Vector database choice for Nepal

pgvector (Postgres extension): our default for SME-scale deployments. Free, runs alongside your existing Postgres, fast enough up to ~5M vectors
Qdrant: self-hostable open source, more performant at scale, JSON-based filter language
Pinecone: managed service, scales infinitely, ~$70/month entry tier — useful when you do not want to run a database
Weaviate: open source, GraphQL native, more complex setup

Hybrid retrieval (the upgrade most teams skip)

Pure vector search misses queries that hinge on rare words (a SKU code, a person name, an account number). Hybrid retrieval combines vector similarity with keyword (BM25) search and takes the union. Quality bump is real — typically 15–25% improvement on retrieval recall in our deployments. pgvector + Postgres full-text search makes this easy; Qdrant has built-in hybrid mode.

Evaluation: do not skip

Build a test set of 50–100 question-and-correct-answer pairs from your actual domain. After every model / chunking / threshold change, run the test set and record retrieval recall@5 (did the right chunk make it into the top 5) and answer accuracy. Without this, you "improve" by vibes and silently regress. For related context, see our Building AI Chatbots for Nepali Customer Support (2026 Engineering Guide) post.

Frequently asked questions

Can I do RAG without a vector database?

For a small knowledge base (<1,000 chunks) yes — embed once, store in a JSON file, brute-force cosine similarity at query time. Above that, use pgvector or Qdrant; brute force gets slow.

How often do I re-embed?

Re-embed any document that changes. Schedule a nightly job that diffs your knowledge source and re-embeds only changed documents. Full re-embed only when you change the embedding model.

Does RAG help with multilingual queries?

Multilingual embedding models cluster semantically similar concepts across languages — a Devanagari query can retrieve relevant English chunks and vice-versa. Useful for cases where your knowledge base is partially in English (technical specs) and customers ask in Nepali.

What's the cost?

Embedding 1 million words of Nepali content via OpenAI text-embedding-3-small: ~NPR 200. Storage in pgvector: free with your Postgres. Query-time cost: ~NPR 0.02 per query (negligible). The expensive part is the LLM call after retrieval, not retrieval itself.

When does RAG fail?

Three common failure modes: (1) the right chunk exists but retrieval misses it (fix: hybrid retrieval, better chunking), (2) right chunk retrieved but LLM ignores it (fix: stronger grounding in system prompt), (3) right answer needs synthesis across many chunks (fix: agentic multi-step retrieval). Always evaluate end-to-end, not just retrieval.

About Praxium Labs

Praxium Labs is Nepal's AI and automation consultancy, based in Lalitpur, Nepal. We help Nepali businesses — and international teams operating from Nepal — ship AI chatbots, n8n workflow automations, machine-learning systems, web and mobile applications, cloud infrastructure, and DevOps pipelines that work in Nepal's real conditions: NPR pricing, eSewa / Khalti / Fonepay integrations, NRB / IRD / SSF compliance, Devanagari language handling, and the network and talent realities most international playbooks miss.

This guide was written by the Praxium Labs engineering team from direct production experience deploying systems for Nepali banks, e-commerce, hospitality, healthcare, NGOs, and startups. If you need this implemented for your team, talk to us for a free 30-minute scoping call — or browse our full services.

What RAG actually is

The five-stage pipeline

Embedding model selection for Nepali

Chunking Devanagari content

Chunk size sweet spot

Vector database choice for Nepal

Hybrid retrieval (the upgrade most teams skip)

Evaluation: do not skip

Frequently asked questions

Who can build this in Nepal?

About Praxium Labs