Lesson 9: When RAG Works (and When It Doesn’t)
The Honest Conversation
RAG is powerful. But it’s not magic, and pretending otherwise leads to disappointment and failed projects. In this lesson, we’ll have an honest conversation about when RAG shines, when it struggles, and how to tell whether your RAG system is actually working well. Understanding limitations is what separates effective practitioners from people who blame the technology when things go wrong.Core Concepts
When RAG Shines
RAG is at its best in specific conditions. Recognizing these helps you identify good use cases: You have good source material RAG retrieves and uses what’s in your knowledge base. If your documents are accurate, well-written, and comprehensive, RAG can surface that quality. If they’re not, no retrieval system will save you. Questions have answers in the documents RAG finds and presents existing information. It doesn’t create new knowledge. If the answer to a question is somewhere in your knowledge base, RAG can find it. If it isn’t, RAG can only acknowledge the gap. Users ask relatively specific questions “What’s the refund policy?” works great. “Tell me everything about your company” is too broad; RAG won’t know what to retrieve. Accuracy and grounding matter When you need responses tied to authoritative sources rather than general AI knowledge, RAG provides that grounding. The ability to cite sources is a genuine advantage. Information is relatively stable If your knowledge base doesn’t change every hour, RAG can provide consistent, reliable answers based on that stable foundation.When RAG Struggles
RAG has real limitations. Knowing them helps you avoid frustration: Knowledge bases are messy or low-quality Garbage in, garbage out. If your documents are outdated, contradictory, poorly organized, or incomplete, RAG will retrieve and use that problematic content. Questions require reasoning beyond the documents RAG finds information; it doesn’t perform novel analysis. “Based on these market trends, what should our strategy be?” requires reasoning that goes beyond retrieval. Questions are too vague or too broad “Help me with my project” gives retrieval nothing to work with. Vague questions lead to vague retrieval, which leads to vague answers. The answer requires combining many sources If answering a question requires synthesizing information from 20 different documents in a complex way, retrieval might not surface all the pieces, and generation might struggle with the synthesis. Real-time or rapidly changing information RAG knowledge bases need to be updated. If information changes by the minute (stock prices, live events), RAG may not keep up. Questions are adversarial or tricky Users trying to manipulate the system or asking deliberately confusing questions can sometimes get RAG to behave unexpectedly.The Needle in a Haystack Problem
Here’s a specific limitation worth understanding: as knowledge bases get very large, finding the exact right information becomes harder. Imagine searching for one specific fact in a million documents. Even with good embeddings and semantic search, the right chunk might not surface in the top results. It’s there, but it’s lost in a sea of somewhat-related content. Implications:- Smaller, focused knowledge bases often outperform massive ones
- Curation matters; don’t include everything just because you can
- For very large knowledge bases, additional organization (categories, metadata filtering) becomes important
Hallucinations Aren’t Gone
RAG reduces hallucinations but doesn’t eliminate them. Why hallucinations can still occur:- The AI misinterprets the context: The retrieved chunk says one thing; the AI understands it differently.
- The AI fills gaps: If the context is incomplete, the AI might fill in plausible-sounding but inaccurate details.
- The AI over-generalizes: The context mentions one example; the AI treats it as a universal rule.
- The retrieval itself fails: If retrieval returns irrelevant chunks, the AI has bad material to work with.
Evaluating RAG Quality
How do you know if your RAG system is actually working well? You need to evaluate it systematically. Retrieval evaluation:- Are the retrieved chunks actually relevant to the query?
- Is the most relevant information making it into the top results?
- Are important chunks being missed?
- Are responses faithful to the retrieved context?
- Does the AI admit uncertainty when appropriate?
- Are responses well-formatted and useful?
- Are users getting their questions answered?
- How often do users find the responses helpful?
- What percentage of queries lead to good outcomes?
- Relevance: Do retrieved chunks relate to the question?
- Faithfulness: Does the response accurately reflect the retrieved information?
- Helpfulness: Did the response actually help the user?
- Groundedness: Is every claim in the response supported by retrieved content?
When to Consider Alternatives
Sometimes RAG isn’t the right tool. Consider alternatives when: You need pure reasoning: Math problems, logic puzzles, or analysis that doesn’t depend on specific documents might be better served by the AI’s base capabilities. You need real-time data: For live information (current prices, weather, sports scores), APIs that fetch real-time data are more appropriate than RAG. The task is creative: Writing fiction, brainstorming ideas, or generating creative content typically doesn’t benefit from retrieval. Simple lookups suffice: If users just need to look up a single fact from a structured database, a simple search or database query might be simpler than RAG. The domain is too specialized: If your domain is so specialized that general embedding models don’t understand the terminology, you might need custom models or different approaches. RAG is one tool in the AI toolkit. Knowing when to use it, and when not to, is a sign of expertise.Try It Yourself
Exercise 1: Diagnose the Problem
For each scenario, identify whether the problem is with retrieval, generation, or the underlying knowledge base:- The system gives an answer, but it’s from a policy that was updated last month. The old policy is still in the system.
- The system retrieves a relevant document about vacation policy, but the response says employees get 20 days when the document says 15.
- A user asks about a product feature, but the feature isn’t documented anywhere. The system confidently describes a feature that doesn’t exist.
- The system retrieves chunks about three different products when the user asked about one specific product.
Exercise 2: Design an Evaluation
You’re deploying a RAG system for customer support. Design an evaluation approach:- What sample queries would you use for testing?
- How would you measure retrieval quality?
- How would you measure response quality?
- What would “good enough” look like for launch?
- How would you continue monitoring after launch?
Exercise 3: Right Tool for the Job
For each task, decide whether RAG is the right approach or suggest an alternative:- Answering questions about your company’s HR policies
- Generating creative marketing slogans
- Calculating compound interest
- Finding relevant research papers on a topic
- Getting the current temperature in a city
- Explaining how your software product works
Common Pitfalls
Pitfall 1: Blaming RAG for Content Problems
When the system gives outdated or wrong answers, the knee-jerk reaction is to blame the technology. But often the real problem is that the knowledge base has outdated or wrong content. The fix: When problems occur, check the source content first. Is the right information in the knowledge base?Pitfall 2: Not Testing with Real Queries
Developers test with questions they think users will ask. Users ask questions developers never imagined. Real-world performance often differs from test performance. The fix: Collect real user queries (even before launch, through research) and evaluate against them. Let real usage guide improvements.Pitfall 3: Ignoring Failure Cases
It’s tempting to focus on the times RAG works great and ignore the times it fails. But failure cases reveal the most about how to improve. The fix: Systematically review failures. Look for patterns. Address root causes rather than just celebrating successes.Pitfall 4: Over-trusting Metrics
Metrics can be gamed or misleading. High retrieval relevance scores don’t guarantee users are happy. Good-looking evaluations don’t mean the system works in practice. The fix: Combine quantitative metrics with qualitative review. Read actual responses. Talk to actual users.Level Up
Here’s a diagnostic challenge: Scenario: You’ve deployed a RAG system for technical support. Users report that answers are “usually okay but sometimes completely wrong.” Leadership wants to know if RAG is the right approach or if you should abandon it. Your task: Design a systematic investigation.- How would you gather data on where the system succeeds and fails?
- What categories of failure would you look for?
- How would you determine if failures are fixable (better retrieval, better prompts, better content) or fundamental limitations?
- What would you recommend to leadership?

