Lesson 4: Building a Knowledge Base
From Messy Documents to Searchable Knowledge
You’ve learned how RAG works and how embeddings capture meaning. But here’s a practical question: What do you actually search through? The answer is your knowledge base: the collection of information that RAG retrieves from. And here’s the thing: how you prepare that knowledge base dramatically affects how well RAG works. Think of it like organizing a library. You could throw all the books in a giant pile (technically a library, but good luck finding anything). Or you could organize them thoughtfully, with clear sections and a system that helps you find what you need. Building a good knowledge base is that organizational work for RAG.Core Concepts
What Can Be a Knowledge Source?
Almost any text-based information can become part of a RAG knowledge base: Documents and files:- PDFs, Word documents, text files
- Help articles and documentation
- Policy manuals and handbooks
- Research papers and reports
- Website pages
- Blog posts
- FAQ sections
- Knowledge base articles
- Database records (converted to text)
- Spreadsheets (with context added)
- Product catalogs
- Customer information
- Email archives
- Chat transcripts
- Meeting notes
- Support tickets
Why We Can’t Just Embed Entire Documents
Here’s an important concept: you typically don’t embed whole documents. Instead, you break them into smaller pieces called chunks. Why? Several reasons: 1. Embedding quality degrades with length An embedding tries to capture the “meaning” of a piece of text. A 50-page document covers many topics, so its embedding becomes a vague average that doesn’t strongly match any specific query. 2. Context window limits When retrieved text gets added to the AI prompt, it counts against the context window. If you retrieve entire long documents, you’ll quickly run out of space. 3. Precision matters If someone asks about vacation policy, you want to retrieve the specific section about vacation, not an entire 100-page employee handbook. Chunking solves these problems by creating focused, manageable pieces of information.The Goldilocks Problem: Chunk Size
Choosing the right chunk size is one of the most important decisions in building a knowledge base. And like Goldilocks, you’re looking for something that’s “just right.” Chunks that are too small:- Lose important context
- A sentence about “15 days” makes no sense without knowing it’s about vacation policy
- Lead to fragmented, confusing retrieval
- Dilute the embedding (meaning gets averaged out)
- Include irrelevant information alongside relevant content
- Use up context window space inefficiently
- Contain complete, self-contained ideas
- Are large enough to be meaningful but small enough to be specific
- Typically range from 200-1000 tokens (roughly 150-750 words)
Chunking Strategies
There are several approaches to breaking documents into chunks: Fixed-size chunking: Split text every N tokens or characters. Simple and predictable, but might cut sentences or paragraphs in awkward places. Semantic chunking: Split at natural boundaries like paragraphs, sections, or topic changes. Preserves meaning better but requires more sophisticated processing. Overlapping chunks: Include some overlap between adjacent chunks so context isn’t lost at boundaries. If chunk 1 ends mid-thought, chunk 2 might include the last few sentences of chunk 1. Hierarchical chunking: Create chunks at multiple levels (document, section, paragraph) and retrieve at the appropriate level based on the query. Most RAG implementations use some combination of these approaches, often splitting at paragraph boundaries with slight overlap.Adding Metadata
Raw text chunks are useful, but chunks with metadata are even better. Metadata is additional information attached to each chunk: Source information:- Document title
- Section heading
- Page number
- URL
- Creation date
- Last updated
- Version number
- Document type (policy, tutorial, FAQ)
- Department or team
- Product or feature area
The Indexing Process
Once you have your chunks and metadata, you need to create the searchable index. The process typically looks like:- Collect: Gather all source documents
- Process: Extract text, handle formatting, clean up errors
- Chunk: Break into appropriately sized pieces
- Enrich: Add metadata to each chunk
- Embed: Convert each chunk to an embedding vector
- Store: Save the chunks, embeddings, and metadata in a vector database
Quality In, Quality Out
Here’s a crucial principle: your RAG system is only as good as your knowledge base. If your source documents are:- Outdated, your answers will be wrong
- Poorly written, your answers will be confusing
- Contradictory, your answers will be inconsistent
- Missing important information, your answers will have gaps
Try It Yourself
Exercise 1: Identify Your Knowledge Sources
Think about a real use case (your job, a project, a personal interest area). List:- What documents or information would you want a RAG system to search?
- Where does that information currently live? (Files, websites, databases, etc.)
- What format is it in? What processing might be needed?
Exercise 2: Chunking Practice
Take a long document you have access to (a report, an article, a manual). Read through and identify:- Natural break points (section headers, topic changes, paragraph boundaries)
- How would you chunk this document?
- Are there places where context would be lost if you split there?
Exercise 3: Metadata Design
For a RAG system serving a customer support team, what metadata would be valuable for each chunk? Consider:- What filtering might users want? (product, topic, date)
- What source information should be displayed with answers?
- What metadata would help identify outdated content?
Common Pitfalls
Pitfall 1: Inconsistent Chunk Sizes
Mixing very small and very large chunks in the same knowledge base leads to unpredictable retrieval. Large chunks dominate some queries; small chunks get lost in others. The fix: Aim for consistent chunk sizes throughout your knowledge base. If some documents naturally have different structures, consider separate indices or normalization.Pitfall 2: Losing Context at Boundaries
When you chunk at arbitrary points, important context can be lost. “The policy is 15 days” means nothing without knowing what policy and what the 15 days refers to. The fix: Use overlap, include section headers in each chunk, or add contextual prefixes that remind the chunk what document and section it came from.Pitfall 3: Neglecting Maintenance
Knowledge bases aren’t “set and forget.” Information changes, documents get updated, new content gets created. A knowledge base that isn’t maintained becomes unreliable. The fix: Build processes for regular updates. Track when content was last indexed. Flag stale content for review.Pitfall 4: Including Everything
The temptation is to include every document you can find. But more isn’t always better. Irrelevant content creates noise that can confuse retrieval. The fix: Be intentional about what goes in. Ask: “Would a user asking questions in this domain need this information?” If not, leave it out.Level Up
Here’s a practical challenge: Scenario: You’re building a knowledge base for a cooking website. You have:- 500 recipes
- 50 technique guides (“How to julienne vegetables”)
- 30 equipment reviews
- 20 ingredient deep-dives
- How would you chunk recipes differently than technique guides?
- What metadata would you attach to each content type?
- How would you handle a recipe that references a technique guide? (Should they be linked somehow?)
- What queries would this knowledge base need to handle well?

