Lesson 9: When RAG Works (and When It Doesn’t)

The Honest Conversation

RAG is powerful. But it’s not magic, and pretending otherwise leads to disappointment and failed projects. In this lesson, we’ll have an honest conversation about when RAG shines, when it struggles, and how to tell whether your RAG system is actually working well. Understanding limitations is what separates effective practitioners from people who blame the technology when things go wrong.

Core Concepts

When RAG Shines

RAG is at its best in specific conditions. Recognizing these helps you identify good use cases: You have good source material RAG retrieves and uses what’s in your knowledge base. If your documents are accurate, well-written, and comprehensive, RAG can surface that quality. If they’re not, no retrieval system will save you. Questions have answers in the documents RAG finds and presents existing information. It doesn’t create new knowledge. If the answer to a question is somewhere in your knowledge base, RAG can find it. If it isn’t, RAG can only acknowledge the gap. Users ask relatively specific questions “What’s the refund policy?” works great. “Tell me everything about your company” is too broad; RAG won’t know what to retrieve. Accuracy and grounding matter When you need responses tied to authoritative sources rather than general AI knowledge, RAG provides that grounding. The ability to cite sources is a genuine advantage. Information is relatively stable If your knowledge base doesn’t change every hour, RAG can provide consistent, reliable answers based on that stable foundation.

When RAG Struggles

RAG has real limitations. Knowing them helps you avoid frustration: Knowledge bases are messy or low-quality Garbage in, garbage out. If your documents are outdated, contradictory, poorly organized, or incomplete, RAG will retrieve and use that problematic content. Questions require reasoning beyond the documents RAG finds information; it doesn’t perform novel analysis. “Based on these market trends, what should our strategy be?” requires reasoning that goes beyond retrieval. Questions are too vague or too broad “Help me with my project” gives retrieval nothing to work with. Vague questions lead to vague retrieval, which leads to vague answers. The answer requires combining many sources If answering a question requires synthesizing information from 20 different documents in a complex way, retrieval might not surface all the pieces, and generation might struggle with the synthesis. Real-time or rapidly changing information RAG knowledge bases need to be updated. If information changes by the minute (stock prices, live events), RAG may not keep up. Questions are adversarial or tricky Users trying to manipulate the system or asking deliberately confusing questions can sometimes get RAG to behave unexpectedly.

The Needle in a Haystack Problem

Here’s a specific limitation worth understanding: as knowledge bases get very large, finding the exact right information becomes harder. Imagine searching for one specific fact in a million documents. Even with good embeddings and semantic search, the right chunk might not surface in the top results. It’s there, but it’s lost in a sea of somewhat-related content. Implications:

Smaller, focused knowledge bases often outperform massive ones
Curation matters; don’t include everything just because you can
For very large knowledge bases, additional organization (categories, metadata filtering) becomes important

Hallucinations Aren’t Gone

RAG reduces hallucinations but doesn’t eliminate them. Why hallucinations can still occur:

The AI misinterprets the context: The retrieved chunk says one thing; the AI understands it differently.
The AI fills gaps: If the context is incomplete, the AI might fill in plausible-sounding but inaccurate details.
The AI over-generalizes: The context mentions one example; the AI treats it as a universal rule.
The retrieval itself fails: If retrieval returns irrelevant chunks, the AI has bad material to work with.

The takeaway: RAG makes AI more reliable, not perfectly reliable. Critical information should still be verified.

Evaluating RAG Quality

How do you know if your RAG system is actually working well? You need to evaluate it systematically. Retrieval evaluation:

Are the retrieved chunks actually relevant to the query?
Is the most relevant information making it into the top results?
Are important chunks being missed?

Generation evaluation:

Are responses faithful to the retrieved context?
Does the AI admit uncertainty when appropriate?
Are responses well-formatted and useful?

End-to-end evaluation:

Are users getting their questions answered?
How often do users find the responses helpful?
What percentage of queries lead to good outcomes?

Common metrics:

Relevance: Do retrieved chunks relate to the question?
Faithfulness: Does the response accurately reflect the retrieved information?
Helpfulness: Did the response actually help the user?
Groundedness: Is every claim in the response supported by retrieved content?

When to Consider Alternatives

Sometimes RAG isn’t the right tool. Consider alternatives when: You need pure reasoning: Math problems, logic puzzles, or analysis that doesn’t depend on specific documents might be better served by the AI’s base capabilities. You need real-time data: For live information (current prices, weather, sports scores), APIs that fetch real-time data are more appropriate than RAG. The task is creative: Writing fiction, brainstorming ideas, or generating creative content typically doesn’t benefit from retrieval. Simple lookups suffice: If users just need to look up a single fact from a structured database, a simple search or database query might be simpler than RAG. The domain is too specialized: If your domain is so specialized that general embedding models don’t understand the terminology, you might need custom models or different approaches. RAG is one tool in the AI toolkit. Knowing when to use it, and when not to, is a sign of expertise.

Try It Yourself

Exercise 1: Diagnose the Problem

For each scenario, identify whether the problem is with retrieval, generation, or the underlying knowledge base:

The system gives an answer, but it’s from a policy that was updated last month. The old policy is still in the system.
The system retrieves a relevant document about vacation policy, but the response says employees get 20 days when the document says 15.
A user asks about a product feature, but the feature isn’t documented anywhere. The system confidently describes a feature that doesn’t exist.
The system retrieves chunks about three different products when the user asked about one specific product.

Exercise 2: Design an Evaluation

You’re deploying a RAG system for customer support. Design an evaluation approach:

What sample queries would you use for testing?
How would you measure retrieval quality?
How would you measure response quality?
What would “good enough” look like for launch?
How would you continue monitoring after launch?

Exercise 3: Right Tool for the Job

For each task, decide whether RAG is the right approach or suggest an alternative:

Answering questions about your company’s HR policies
Generating creative marketing slogans
Calculating compound interest
Finding relevant research papers on a topic
Getting the current temperature in a city
Explaining how your software product works

Common Pitfalls

Pitfall 1: Blaming RAG for Content Problems

When the system gives outdated or wrong answers, the knee-jerk reaction is to blame the technology. But often the real problem is that the knowledge base has outdated or wrong content. The fix: When problems occur, check the source content first. Is the right information in the knowledge base?

Pitfall 2: Not Testing with Real Queries

Developers test with questions they think users will ask. Users ask questions developers never imagined. Real-world performance often differs from test performance. The fix: Collect real user queries (even before launch, through research) and evaluate against them. Let real usage guide improvements.

Pitfall 3: Ignoring Failure Cases

It’s tempting to focus on the times RAG works great and ignore the times it fails. But failure cases reveal the most about how to improve. The fix: Systematically review failures. Look for patterns. Address root causes rather than just celebrating successes.

Pitfall 4: Over-trusting Metrics

Metrics can be gamed or misleading. High retrieval relevance scores don’t guarantee users are happy. Good-looking evaluations don’t mean the system works in practice. The fix: Combine quantitative metrics with qualitative review. Read actual responses. Talk to actual users.

Level Up

Here’s a diagnostic challenge: Scenario: You’ve deployed a RAG system for technical support. Users report that answers are “usually okay but sometimes completely wrong.” Leadership wants to know if RAG is the right approach or if you should abandon it. Your task: Design a systematic investigation.

How would you gather data on where the system succeeds and fails?
What categories of failure would you look for?
How would you determine if failures are fixable (better retrieval, better prompts, better content) or fundamental limitations?
What would you recommend to leadership?

Write out your investigation plan. This exercise builds the diagnostic skills that separate RAG practitioners from RAG dabblers.

Key Takeaway

RAG is powerful but not magic. It works best when you have good source material, specific questions, and a need for grounded, verifiable responses. It struggles with poor-quality knowledge bases, questions requiring complex reasoning, and very broad queries. Hallucinations are reduced but not eliminated. Systematic evaluation is essential for understanding whether your RAG system is actually meeting user needs. Knowing when RAG is the right tool, and when it isn’t, is the mark of a skilled practitioner.

What’s Next

You’ve learned what RAG is, how it works, and when to use it. Now you’re ready to take action. In Lesson 10: Getting Started with RAG, we’ll explore practical paths forward. Whether you want to use a platform like Gloo AI Studio, integrate APIs, or build something custom, you’ll leave with a clear roadmap for turning your RAG knowledge into RAG practice.

Gloo AI 101

AI 102: Prompting

AI 103: RAG

When RAG Works (and When It Doesn't)

Lesson 9: When RAG Works (and When It Doesn’t)

The Honest Conversation

Core Concepts

When RAG Shines

When RAG Struggles

The Needle in a Haystack Problem

Hallucinations Aren’t Gone

Evaluating RAG Quality

When to Consider Alternatives

Try It Yourself

Exercise 1: Diagnose the Problem

Exercise 2: Design an Evaluation

Exercise 3: Right Tool for the Job

Common Pitfalls

Pitfall 1: Blaming RAG for Content Problems

Pitfall 2: Not Testing with Real Queries

Pitfall 3: Ignoring Failure Cases

Pitfall 4: Over-trusting Metrics

Level Up

Key Takeaway

What’s Next

Gloo AI 101

AI 102: Prompting

AI 103: RAG

​Lesson 9: When RAG Works (and When It Doesn’t)

​The Honest Conversation

​Core Concepts

​When RAG Shines

​When RAG Struggles

​The Needle in a Haystack Problem

​Hallucinations Aren’t Gone

​Evaluating RAG Quality

​When to Consider Alternatives

​Try It Yourself

​Exercise 1: Diagnose the Problem

​Exercise 2: Design an Evaluation

​Exercise 3: Right Tool for the Job

​Common Pitfalls

​Pitfall 1: Blaming RAG for Content Problems

​Pitfall 2: Not Testing with Real Queries

​Pitfall 3: Ignoring Failure Cases

​Pitfall 4: Over-trusting Metrics

​Level Up

​Key Takeaway

​What’s Next

Lesson 9: When RAG Works (and When It Doesn’t)

The Honest Conversation

Core Concepts

When RAG Shines

When RAG Struggles

The Needle in a Haystack Problem

Hallucinations Aren’t Gone

Evaluating RAG Quality

When to Consider Alternatives

Try It Yourself

Exercise 1: Diagnose the Problem

Exercise 2: Design an Evaluation

Exercise 3: Right Tool for the Job

Common Pitfalls

Pitfall 1: Blaming RAG for Content Problems

Pitfall 2: Not Testing with Real Queries

Pitfall 3: Ignoring Failure Cases

Pitfall 4: Over-trusting Metrics

Level Up

Key Takeaway

What’s Next