Lesson 5: The Retrieval Step
Finding the Needle in the Haystack
You have a knowledge base with thousands of chunks, each converted to an embedding. A user asks a question. Now what? The retrieval step is where the magic happens: out of all that information, the system needs to find the specific chunks that are most relevant to the user’s question. Get this right, and the AI has great material to work with. Get it wrong, and even the smartest AI will give unhelpful answers. Let’s understand how retrieval actually works.Core Concepts
From Question to Embedding
The retrieval process starts by converting the user’s question into an embedding using the same model that embedded the knowledge base. This is crucial: the question and the documents need to be in the same “embedding space” to be comparable. It’s like having a conversation where everyone speaks the same language. If the question is embedded with a different model than the documents, the similarity comparisons become meaningless. So when a user asks “What’s the refund policy for digital products?”, that question becomes a vector of numbers, just like all the chunks in your knowledge base.Similarity Search: Finding the Closest Matches
With both the question and all the chunks represented as embeddings, we can now compare them mathematically. The most common approach is cosine similarity, which measures the angle between two vectors:- Vectors pointing in the same direction: high similarity (close to 1)
- Vectors pointing in different directions: low similarity (close to 0)
- Vectors pointing in opposite directions: negative similarity (close to -1)
Top-K Retrieval: Why We Don’t Just Take the Best Match
You might think: just retrieve the single most similar chunk and use that. But in practice, we usually retrieve multiple chunks (the “top K,” where K is typically 3-10). Why retrieve multiple chunks? 1. Relevant information might be spread across chunks The answer to “How do I apply for a refund and how long does it take?” might require information from two different sections. 2. The top match isn’t always the best Similarity scores are imperfect. The second or third match might actually be more useful for a particular question. 3. More context is often better Within reason, giving the AI more relevant context leads to more comprehensive answers. 4. Redundancy provides confirmation If multiple chunks say the same thing, the AI can be more confident in that information. The value of K is a tuning parameter. Too few, and you might miss relevant information. Too many, and you dilute focus or hit context window limits.Beyond Simple Similarity: Ranking Strategies
Basic similarity search is a great start, but sophisticated RAG systems often layer additional ranking strategies: Recency weighting: More recent documents might be more relevant, especially for policies or news. You can boost the scores of newer content. Authority weighting: Some sources are more authoritative than others. Official documentation might be weighted higher than blog posts. Diversity sampling: If the top 5 results all say the same thing, you might want to include a 6th result that offers a different perspective, even if its similarity score is lower. Query expansion: Sometimes the user’s question doesn’t perfectly match how information is phrased in documents. You can expand the query with synonyms or related terms to improve retrieval. Hybrid search: Combine semantic similarity with keyword matching. Sometimes exact keyword matches are important (like product names or technical terms that embeddings might not handle perfectly).The Relevance Threshold
Not every query will have good matches in your knowledge base. If someone asks about a topic that isn’t covered in your documents, even the “best” matches might be irrelevant. Many systems use a relevance threshold: if no chunks score above a certain similarity level, the system might:- Return no results
- Acknowledge that it doesn’t have information on this topic
- Fall back to the AI’s general knowledge (with appropriate caveats)
Speed vs. Accuracy Trade-offs
Searching through millions of embeddings to find the most similar ones could be slow if done naively. In practice, RAG systems use specialized techniques: Vector databases: Purpose-built databases optimized for similarity search, like Pinecone, Weaviate, or Chroma. Approximate nearest neighbor (ANN) algorithms: Instead of comparing against every single embedding, these algorithms use clever indexing to quickly find approximately the best matches. They trade a tiny bit of accuracy for massive speed improvements. For most applications, these approximate methods are good enough. The chunk that’s technically ranked 5th might show up as 4th, but the overall retrieval quality remains high.Try It Yourself
Exercise 1: Think Like a Retrieval System
Imagine a knowledge base about a software product with these chunks:- “Installation guide for Windows users…”
- “Pricing plans start at $10/month…”
- “To reset your password, click ‘Forgot Password’…”
- “System requirements: Windows 10+, 4GB RAM…”
- “Our refund policy allows returns within 30 days…”
- “How do I install the software?”
- “What are the hardware requirements?”
- “I forgot my password”
- “How much does it cost?”
- “Can I get my money back?”
Exercise 2: Design Your Top-K Strategy
For a customer support RAG system, consider:- What value of K would you choose? Why?
- Would you use any ranking adjustments beyond similarity? (Recency? Authority?)
- What would you do when no results pass the relevance threshold?
Exercise 3: Edge Cases
Think of questions that would be challenging for retrieval:- Questions that use very different words than the source documents
- Questions that require combining information from multiple unrelated sections
- Questions about topics not covered in the knowledge base
- Questions with ambiguous terms that could match multiple topics
Common Pitfalls
Pitfall 1: Retrieving Based on Topic, Not Relevance
A chunk might be about the same general topic as the question but not actually answer it. “Our company history began in 1995…” is about the company but doesn’t help someone asking “What products do you sell?” The fix: Evaluate retrieval quality with real questions. Look at what actually gets retrieved and ask: does this help answer the question?Pitfall 2: Ignoring Retrieval Failures
If retrieval returns poor results, the AI will generate poor responses. But it’s easy to blame the AI when the real problem is retrieval. The fix: When RAG responses are bad, check what was retrieved first. The generation step can only work with what retrieval provides.Pitfall 3: One-Size-Fits-All K
Using the same number of retrieved chunks for every query isn’t optimal. A simple factual question might need just one chunk. A complex analysis might need ten. The fix: Consider dynamic retrieval strategies that adjust K based on query complexity or result quality.Pitfall 4: Not Testing with Real Queries
Retrieval that works great for your test queries might fail for the queries real users actually ask. Users phrase things differently than you expect. The fix: Collect real user queries and regularly evaluate retrieval performance against them. Let real usage inform your tuning.Level Up
Here’s a challenge to deepen your understanding: Scenario: You’re building a RAG system for a legal research application. Lawyers need to find relevant case law and statutes. Consider these complications:- Legal documents use very specific terminology. How would you ensure retrieval handles this?
- A query like “cases involving breach of contract in California from 2020-2023” requires matching on concepts (breach of contract), location (California), and time range (2020-2023). How would pure semantic search handle this?
- Some legal precedents are more authoritative than others. How might you incorporate this?

