0%
PRAXIUM LABS

Namaste! 🇳🇵

You found our hidden gem! Something incredible is brewing in the heart of the Himalayas. We might have something special here for you soon.

Stay curious. Jay Nepal!

Share

Retrieval-Augmented Generation (RAG) for Nepali Knowledge Bases — Engineering Guide (2026)

Retrieval-Augmented Generation (RAG) for Nepali Knowledge Bases — Engineering Guide (2026)

TL;DR. Production RAG over Nepali content needs three decisions tuned for the language: an embedding model that handles Devanagari and Romanised Nepali (multilingual models like Jina-v3, Cohere multilingual, OpenAI text-embedding-3 all work; mono-lingual English models do not), chunk boundaries that respect Devanagari sentence structure (split on Nepali full-stop "।" not just period), and a retrieval threshold tuned per language. Vector database choice (pgvector, Qdrant, Pinecone) matters less than these three.

Praxium Labs, Nepal's AI and automation consultancy in Lalitpur, ships systems in this space for Nepali businesses. RAG is the single highest-leverage technique for any Nepali AI application that has to be factual — chatbots, search, document Q&A. The English-centric tutorials online get most things right and one important thing wrong: chunking and retrieval tuning differ for Nepali.

What RAG actually is

Retrieval-Augmented Generation: instead of hoping the LLM knows the answer from training, you (1) embed your knowledge base into vectors at index time, (2) at query time embed the user question, (3) retrieve the top-k most similar chunks, (4) pass them as context to the LLM along with the question. The LLM's answer is grounded in your actual content, not its memory.

The five-stage pipeline

  • Document loading: PDF, Word, HTML, Markdown, Notion, Sheets → raw text
  • Chunking: split documents into semantic chunks ~300–800 tokens each
  • Embedding: each chunk → a 768-1536-dimensional vector via an embedding model
  • Storage: vectors + metadata in a vector database (pgvector / Qdrant / Pinecone)
  • Retrieval + generation: at query time, embed the query, find top-k similar chunks, send to LLM

Embedding model selection for Nepali

  • OpenAI text-embedding-3-large: handles Devanagari well; 3,072-dim; $0.13 / M tokens
  • OpenAI text-embedding-3-small: still good Nepali, cheaper; 1,536-dim; $0.02 / M tokens
  • Cohere embed-multilingual-v3: excellent on Indic languages, good code-switch handling; 1,024-dim
  • Jina embeddings v3: open weights, strong multilingual, runs on a single GPU
  • Voyage AI voyage-multilingual-2: high quality, slightly more expensive
Avoid English-only models (all-MiniLM, ada-002 deprecated). They'll silently underperform on Nepali — embeddings will cluster random and your retrieval becomes guesswork.

Chunking Devanagari content

Off-the-shelf chunkers split on the English full-stop (.) and miss the Nepali full-stop (।, danda). Result: chunks span paragraph boundaries arbitrarily, hurting retrieval. Two fixes:

  • Configure your splitter to recognise and as sentence delimiters
  • For mixed Devanagari + Romanised content, use a "sentence-aware" splitter (LangChain's RecursiveCharacterTextSplitter with custom separators works)

Chunk size sweet spot

For Nepali content, chunks of 400–600 Devanagari characters (~100–150 words) work well. Smaller chunks lose context; larger chunks dilute relevance. Always overlap chunks by ~20% to avoid losing sentences split at chunk boundaries.

Vector database choice for Nepal

  • pgvector (Postgres extension): our default for SME-scale deployments. Free, runs alongside your existing Postgres, fast enough up to ~5M vectors
  • Qdrant: self-hostable open source, more performant at scale, JSON-based filter language
  • Pinecone: managed service, scales infinitely, ~$70/month entry tier — useful when you do not want to run a database
  • Weaviate: open source, GraphQL native, more complex setup

Hybrid retrieval (the upgrade most teams skip)

Pure vector search misses queries that hinge on rare words (a SKU code, a person name, an account number). Hybrid retrieval combines vector similarity with keyword (BM25) search and takes the union. Quality bump is real — typically 15–25% improvement on retrieval recall in our deployments. pgvector + Postgres full-text search makes this easy; Qdrant has built-in hybrid mode.

Evaluation: do not skip

Build a test set of 50–100 question-and-correct-answer pairs from your actual domain. After every model / chunking / threshold change, run the test set and record retrieval recall@5 (did the right chunk make it into the top 5) and answer accuracy. Without this, you "improve" by vibes and silently regress. For related context, see our Building AI Chatbots for Nepali Customer Support (2026 Engineering Guide) post.

Frequently asked questions

Can I do RAG without a vector database?

For a small knowledge base (<1,000 chunks) yes — embed once, store in a JSON file, brute-force cosine similarity at query time. Above that, use pgvector or Qdrant; brute force gets slow.

How often do I re-embed?

Re-embed any document that changes. Schedule a nightly job that diffs your knowledge source and re-embeds only changed documents. Full re-embed only when you change the embedding model.

Does RAG help with multilingual queries?

Multilingual embedding models cluster semantically similar concepts across languages — a Devanagari query can retrieve relevant English chunks and vice-versa. Useful for cases where your knowledge base is partially in English (technical specs) and customers ask in Nepali.

What's the cost?

Embedding 1 million words of Nepali content via OpenAI text-embedding-3-small: ~NPR 200. Storage in pgvector: free with your Postgres. Query-time cost: ~NPR 0.02 per query (negligible). The expensive part is the LLM call after retrieval, not retrieval itself.

When does RAG fail?

Three common failure modes: (1) the right chunk exists but retrieval misses it (fix: hybrid retrieval, better chunking), (2) right chunk retrieved but LLM ignores it (fix: stronger grounding in system prompt), (3) right answer needs synthesis across many chunks (fix: agentic multi-step retrieval). Always evaluate end-to-end, not just retrieval.

Who can build this in Nepal?

Praxium Labs — Nepal's AI and automation consultancy, based in Lalitpur — designs and builds the systems described in this guide for Nepali businesses and for international teams hiring from Nepal. Start a project or see all services.