At Praxium Labs — Nepal's AI and automation consultancy — we see this pattern across most Nepali engagements. Every Nepali business that deploys AI eventually faces a customer-facing hallucination — a confidently-stated wrong answer about a price, policy, or fact. The damage compounds quickly. The good news: hallucinations are tractable with disciplined engineering.
Where hallucinations come from
- Training-data gaps: the model has no idea but generates plausibly-sounding text
- Stale information: training data was current as of date X; the world moved on
- Prompt ambiguity: the user's question could be interpreted multiple ways and the model picks one without saying so
- Context contamination: retrieved chunks were partly relevant; the model conflates them
- Over-eagerness to help: some models (especially GPT) lean toward generating an answer rather than admitting ignorance
Pattern 1 — Strict RAG grounding
For factual answers (prices, policies, statistics), the model should answer ONLY from retrieved context. System prompt: "Use only the provided context to answer. If the context does not contain the answer, say \"I do not have that information; let me get a human to help\"." Pair with a confidence-threshold check: if no retrieved chunk has a similarity score above X, escalate to human regardless of what the model wants to say.
Pattern 2 — Source citation on every factual claim
Every numeric or policy answer ends with the source document name and last-updated date. This (a) lets the user verify, (b) creates downstream pressure for the LLM to actually use the source, (c) surfaces stale documents fast (the "last updated 8 months ago" badge prompts you to refresh).
Pattern 3 — Allowed "I don't know"
Train your system prompt to explicitly permit and reward "I don't know" responses. Most models default to over-confidence; you must override that. Examples in system prompt of when "I don't know" is the correct answer dramatically improve safety.
Pattern 4 — Confidence calibration via dual model
For high-stakes outputs, route the same question through two different models (Claude + GPT-4o for example) and only return the answer if they agree. Disagreement is itself a signal — escalate to human. Doubles cost but worth it for regulated domains.
Pattern 5 — Programmatic guardrails after generation
Post-generation checks: (a) does the answer mention amounts not in retrieved context? Flag, (b) does the answer make claims about future dates? Flag, (c) does the answer contradict a known fact in your fact-table? Block. These are simple Python checks but catch a meaningful fraction of egregious hallucinations.
Pattern 6 — Systematic eval suite
Build a list of 50–200 hard questions where you know the correct answer. Run it weekly. Track accuracy over time. When a model upgrade or prompt change drops accuracy on a category, you find out before customers do. The mature teams treat this like a unit-test suite for AI. For related context, see our Building AI Chatbots for Nepali Customer Support (2026 Engineering Guide) post.
What does NOT work
- "Just use a smarter model": smarter models hallucinate less but still hallucinate. Pattern matters more than model choice
- Vendor "anti-hallucination" features: mostly marketing of techniques you should be doing yourself
- Adding "Do not hallucinate" to the prompt: models do not understand this instruction operationally
- Temperature = 0: reduces variance, not hallucination
Frequently asked questions
What's the highest-leverage single fix?
Strict RAG grounding with an "I don't know" path. Most teams skip this because the model "feels chatty" and they like the conversational tone. The trade-off is unambiguously worth it for any factual customer-facing use case.
How do I measure hallucination rate?
Sample 100 outputs per week, have a human grader classify each as factually correct / factually wrong / unverifiable. Track over time. Wrong-answer rate above 1% in production is alarming; above 5% is unacceptable.
Does Claude hallucinate less than GPT?
In our evals on Nepali factual content: marginally, in the direction of "more willing to say I don't know". GPT-4o is more eager to produce an answer. Both can be coaxed into either mode with good prompting.
What about agents and tool-use?
Agents hallucinate not just text but tool calls — the agent confidently calls a tool with wrong parameters. Pattern: structured tool definitions with parameter validation, fallback to human review on parameter failures, and rate-limits per tool.
How does Nepal-specific content fare?
Worse than English by default — frontier models have less Nepal-specific training data, so they fall back to invention faster. RAG grounding becomes even more important for Nepal-specific topics.
Who can build this in Nepal?
Praxium Labs — Nepal's AI and automation consultancy, based in Lalitpur — designs and builds the systems described in this guide for Nepali businesses and for international teams hiring from Nepal. Start a project or see all services.