Lesson 5: The Retrieval Step

Finding the Needle in the Haystack

You have a knowledge base with thousands of chunks, each converted to an embedding. A user asks a question. Now what? The retrieval step is where the magic happens: out of all that information, the system needs to find the specific chunks that are most relevant to the user’s question. Get this right, and the AI has great material to work with. Get it wrong, and even the smartest AI will give unhelpful answers. Let’s understand how retrieval actually works.

Core Concepts

From Question to Embedding

The retrieval process starts by converting the user’s question into an embedding using the same model that embedded the knowledge base. This is crucial: the question and the documents need to be in the same “embedding space” to be comparable. It’s like having a conversation where everyone speaks the same language. If the question is embedded with a different model than the documents, the similarity comparisons become meaningless. So when a user asks “What’s the refund policy for digital products?”, that question becomes a vector of numbers, just like all the chunks in your knowledge base.

Similarity Search: Finding the Closest Matches

With both the question and all the chunks represented as embeddings, we can now compare them mathematically. The most common approach is cosine similarity, which measures the angle between two vectors:

Vectors pointing in the same direction: high similarity (close to 1)
Vectors pointing in different directions: low similarity (close to 0)
Vectors pointing in opposite directions: negative similarity (close to -1)

The system calculates the similarity between the question embedding and every chunk embedding in the knowledge base, then ranks them from most similar to least similar. Chunks at the top of this ranking are the ones semantically closest to the user’s question. They’re the most likely to contain relevant information.

Top-K Retrieval: Why We Don’t Just Take the Best Match

You might think: just retrieve the single most similar chunk and use that. But in practice, we usually retrieve multiple chunks (the “top K,” where K is typically 3-10). Why retrieve multiple chunks? 1. Relevant information might be spread across chunks The answer to “How do I apply for a refund and how long does it take?” might require information from two different sections. 2. The top match isn’t always the best Similarity scores are imperfect. The second or third match might actually be more useful for a particular question. 3. More context is often better Within reason, giving the AI more relevant context leads to more comprehensive answers. 4. Redundancy provides confirmation If multiple chunks say the same thing, the AI can be more confident in that information. The value of K is a tuning parameter. Too few, and you might miss relevant information. Too many, and you dilute focus or hit context window limits.

Beyond Simple Similarity: Ranking Strategies

Basic similarity search is a great start, but sophisticated RAG systems often layer additional ranking strategies: Recency weighting: More recent documents might be more relevant, especially for policies or news. You can boost the scores of newer content. Authority weighting: Some sources are more authoritative than others. Official documentation might be weighted higher than blog posts. Diversity sampling: If the top 5 results all say the same thing, you might want to include a 6th result that offers a different perspective, even if its similarity score is lower. Query expansion: Sometimes the user’s question doesn’t perfectly match how information is phrased in documents. You can expand the query with synonyms or related terms to improve retrieval. Hybrid search: Combine semantic similarity with keyword matching. Sometimes exact keyword matches are important (like product names or technical terms that embeddings might not handle perfectly).

The Relevance Threshold

Not every query will have good matches in your knowledge base. If someone asks about a topic that isn’t covered in your documents, even the “best” matches might be irrelevant. Many systems use a relevance threshold: if no chunks score above a certain similarity level, the system might:

Return no results
Acknowledge that it doesn’t have information on this topic
Fall back to the AI’s general knowledge (with appropriate caveats)

This prevents the system from confidently presenting irrelevant information just because it was the “best” match available.

Speed vs. Accuracy Trade-offs

Searching through millions of embeddings to find the most similar ones could be slow if done naively. In practice, RAG systems use specialized techniques: Vector databases: Purpose-built databases optimized for similarity search, like Pinecone, Weaviate, or Chroma. Approximate nearest neighbor (ANN) algorithms: Instead of comparing against every single embedding, these algorithms use clever indexing to quickly find approximately the best matches. They trade a tiny bit of accuracy for massive speed improvements. For most applications, these approximate methods are good enough. The chunk that’s technically ranked 5th might show up as 4th, but the overall retrieval quality remains high.

Try It Yourself

Exercise 1: Think Like a Retrieval System

Imagine a knowledge base about a software product with these chunks:

“Installation guide for Windows users…”
“Pricing plans start at $10/month…”
“To reset your password, click ‘Forgot Password’…”
“System requirements: Windows 10+, 4GB RAM…”
“Our refund policy allows returns within 30 days…”

For each user question, which chunk(s) would you expect to be retrieved?

“How do I install the software?”
“What are the hardware requirements?”
“I forgot my password”
“How much does it cost?”
“Can I get my money back?”

Now think: Are there questions where the right chunk might NOT be the top similarity match? (Hint: “What do I need to run this software?” might match “System requirements” better than “Installation guide,” even though both are relevant.)

Exercise 2: Design Your Top-K Strategy

For a customer support RAG system, consider:

What value of K would you choose? Why?
Would you use any ranking adjustments beyond similarity? (Recency? Authority?)
What would you do when no results pass the relevance threshold?

There are no perfect answers; this exercise builds your intuition for the trade-offs involved.

Exercise 3: Edge Cases

Think of questions that would be challenging for retrieval:

Questions that use very different words than the source documents
Questions that require combining information from multiple unrelated sections
Questions about topics not covered in the knowledge base
Questions with ambiguous terms that could match multiple topics

How might you handle each of these?

Common Pitfalls

Pitfall 1: Retrieving Based on Topic, Not Relevance

A chunk might be about the same general topic as the question but not actually answer it. “Our company history began in 1995…” is about the company but doesn’t help someone asking “What products do you sell?” The fix: Evaluate retrieval quality with real questions. Look at what actually gets retrieved and ask: does this help answer the question?

Pitfall 2: Ignoring Retrieval Failures

If retrieval returns poor results, the AI will generate poor responses. But it’s easy to blame the AI when the real problem is retrieval. The fix: When RAG responses are bad, check what was retrieved first. The generation step can only work with what retrieval provides.

Pitfall 3: One-Size-Fits-All K

Using the same number of retrieved chunks for every query isn’t optimal. A simple factual question might need just one chunk. A complex analysis might need ten. The fix: Consider dynamic retrieval strategies that adjust K based on query complexity or result quality.

Pitfall 4: Not Testing with Real Queries

Retrieval that works great for your test queries might fail for the queries real users actually ask. Users phrase things differently than you expect. The fix: Collect real user queries and regularly evaluate retrieval performance against them. Let real usage inform your tuning.

Level Up

Here’s a challenge to deepen your understanding: Scenario: You’re building a RAG system for a legal research application. Lawyers need to find relevant case law and statutes. Consider these complications:

Legal documents use very specific terminology. How would you ensure retrieval handles this?
A query like “cases involving breach of contract in California from 2020-2023” requires matching on concepts (breach of contract), location (California), and time range (2020-2023). How would pure semantic search handle this?
Some legal precedents are more authoritative than others. How might you incorporate this?

Design a retrieval strategy that addresses these challenges. What combinations of techniques would you use?

Key Takeaway

Retrieval is where RAG succeeds or fails. The process converts the user’s question into an embedding, searches the knowledge base for the most similar chunks, and returns the top K results. Getting retrieval right requires tuning: how many chunks to retrieve, what ranking strategies to use, and when to acknowledge that the knowledge base doesn’t have the answer. Good retrieval means good input for the generation step; bad retrieval dooms the entire response.

What’s Next

You’ve found the relevant chunks. Now what? They need to be combined with the user’s question in a way that helps the AI generate a great response. In Lesson 6: The Augmentation Step, we’ll explore how retrieved information gets woven into prompts. You’ll learn about prompt templates, managing context length, and telling the AI how to use (and not misuse) the information you’ve provided.

Gloo AI 101

AI 102: Prompting

AI 103: RAG

The Retrieval Step

Lesson 5: The Retrieval Step

Finding the Needle in the Haystack

Core Concepts

From Question to Embedding

Similarity Search: Finding the Closest Matches

Top-K Retrieval: Why We Don’t Just Take the Best Match

Beyond Simple Similarity: Ranking Strategies

The Relevance Threshold

Speed vs. Accuracy Trade-offs

Try It Yourself

Exercise 1: Think Like a Retrieval System

Exercise 2: Design Your Top-K Strategy

Exercise 3: Edge Cases

Common Pitfalls

Pitfall 1: Retrieving Based on Topic, Not Relevance

Pitfall 2: Ignoring Retrieval Failures

Pitfall 3: One-Size-Fits-All K

Pitfall 4: Not Testing with Real Queries

Level Up

Key Takeaway

What’s Next

Gloo AI 101

AI 102: Prompting

AI 103: RAG

​Lesson 5: The Retrieval Step

​Finding the Needle in the Haystack

​Core Concepts

​From Question to Embedding

​Similarity Search: Finding the Closest Matches

​Top-K Retrieval: Why We Don’t Just Take the Best Match

​Beyond Simple Similarity: Ranking Strategies

​The Relevance Threshold

​Speed vs. Accuracy Trade-offs

​Try It Yourself

​Exercise 1: Think Like a Retrieval System

​Exercise 2: Design Your Top-K Strategy

​Exercise 3: Edge Cases

​Common Pitfalls

​Pitfall 1: Retrieving Based on Topic, Not Relevance

​Pitfall 2: Ignoring Retrieval Failures

​Pitfall 3: One-Size-Fits-All K

​Pitfall 4: Not Testing with Real Queries

​Level Up

​Key Takeaway

​What’s Next

Lesson 5: The Retrieval Step

Finding the Needle in the Haystack

Core Concepts

From Question to Embedding

Similarity Search: Finding the Closest Matches

Top-K Retrieval: Why We Don’t Just Take the Best Match

Beyond Simple Similarity: Ranking Strategies

The Relevance Threshold

Speed vs. Accuracy Trade-offs

Try It Yourself

Exercise 1: Think Like a Retrieval System

Exercise 2: Design Your Top-K Strategy

Exercise 3: Edge Cases

Common Pitfalls

Pitfall 1: Retrieving Based on Topic, Not Relevance

Pitfall 2: Ignoring Retrieval Failures

Pitfall 3: One-Size-Fits-All K

Pitfall 4: Not Testing with Real Queries

Level Up

Key Takeaway

What’s Next