Skip to main content

Lesson 4: Building a Knowledge Base

From Messy Documents to Searchable Knowledge

You’ve learned how RAG works and how embeddings capture meaning. But here’s a practical question: What do you actually search through? The answer is your knowledge base: the collection of information that RAG retrieves from. And here’s the thing: how you prepare that knowledge base dramatically affects how well RAG works. Think of it like organizing a library. You could throw all the books in a giant pile (technically a library, but good luck finding anything). Or you could organize them thoughtfully, with clear sections and a system that helps you find what you need. Building a good knowledge base is that organizational work for RAG.

Core Concepts

What Can Be a Knowledge Source?

Almost any text-based information can become part of a RAG knowledge base: Documents and files:
  • PDFs, Word documents, text files
  • Help articles and documentation
  • Policy manuals and handbooks
  • Research papers and reports
Web content:
  • Website pages
  • Blog posts
  • FAQ sections
  • Knowledge base articles
Structured data:
  • Database records (converted to text)
  • Spreadsheets (with context added)
  • Product catalogs
  • Customer information
Conversations and communications:
  • Email archives
  • Chat transcripts
  • Meeting notes
  • Support tickets
The key is that the information needs to be converted to text that can be embedded and searched. Images, videos, and audio generally need to be transcribed or described in text form.

Why We Can’t Just Embed Entire Documents

Here’s an important concept: you typically don’t embed whole documents. Instead, you break them into smaller pieces called chunks. Why? Several reasons: 1. Embedding quality degrades with length An embedding tries to capture the “meaning” of a piece of text. A 50-page document covers many topics, so its embedding becomes a vague average that doesn’t strongly match any specific query. 2. Context window limits When retrieved text gets added to the AI prompt, it counts against the context window. If you retrieve entire long documents, you’ll quickly run out of space. 3. Precision matters If someone asks about vacation policy, you want to retrieve the specific section about vacation, not an entire 100-page employee handbook. Chunking solves these problems by creating focused, manageable pieces of information.

The Goldilocks Problem: Chunk Size

Choosing the right chunk size is one of the most important decisions in building a knowledge base. And like Goldilocks, you’re looking for something that’s “just right.” Chunks that are too small:
  • Lose important context
  • A sentence about “15 days” makes no sense without knowing it’s about vacation policy
  • Lead to fragmented, confusing retrieval
Chunks that are too big:
  • Dilute the embedding (meaning gets averaged out)
  • Include irrelevant information alongside relevant content
  • Use up context window space inefficiently
Chunks that are just right:
  • Contain complete, self-contained ideas
  • Are large enough to be meaningful but small enough to be specific
  • Typically range from 200-1000 tokens (roughly 150-750 words)
The ideal size depends on your content. Dense technical documentation might need smaller chunks. Conversational content might work better with larger ones.

Chunking Strategies

There are several approaches to breaking documents into chunks: Fixed-size chunking: Split text every N tokens or characters. Simple and predictable, but might cut sentences or paragraphs in awkward places. Semantic chunking: Split at natural boundaries like paragraphs, sections, or topic changes. Preserves meaning better but requires more sophisticated processing. Overlapping chunks: Include some overlap between adjacent chunks so context isn’t lost at boundaries. If chunk 1 ends mid-thought, chunk 2 might include the last few sentences of chunk 1. Hierarchical chunking: Create chunks at multiple levels (document, section, paragraph) and retrieve at the appropriate level based on the query. Most RAG implementations use some combination of these approaches, often splitting at paragraph boundaries with slight overlap.

Adding Metadata

Raw text chunks are useful, but chunks with metadata are even better. Metadata is additional information attached to each chunk: Source information:
  • Document title
  • Section heading
  • Page number
  • URL
Temporal information:
  • Creation date
  • Last updated
  • Version number
Categorical information:
  • Document type (policy, tutorial, FAQ)
  • Department or team
  • Product or feature area
Metadata enables filtering (only search HR documents), provides context (this chunk is from page 15 of the Employee Handbook), and helps users verify sources.

The Indexing Process

Once you have your chunks and metadata, you need to create the searchable index. The process typically looks like:
  1. Collect: Gather all source documents
  2. Process: Extract text, handle formatting, clean up errors
  3. Chunk: Break into appropriately sized pieces
  4. Enrich: Add metadata to each chunk
  5. Embed: Convert each chunk to an embedding vector
  6. Store: Save the chunks, embeddings, and metadata in a vector database
This process needs to happen initially when you set up the system, and then again whenever your source content changes. Platforms like Gloo AI Studio handle much of this complexity for you, allowing you to upload content and have it automatically processed into a searchable knowledge base with appropriate safety considerations.

Quality In, Quality Out

Here’s a crucial principle: your RAG system is only as good as your knowledge base. If your source documents are:
  • Outdated, your answers will be wrong
  • Poorly written, your answers will be confusing
  • Contradictory, your answers will be inconsistent
  • Missing important information, your answers will have gaps
RAG doesn’t magically improve bad content. It just makes that content retrievable and uses it to generate responses. Garbage in, garbage out still applies.

Try It Yourself

Exercise 1: Identify Your Knowledge Sources

Think about a real use case (your job, a project, a personal interest area). List:
  1. What documents or information would you want a RAG system to search?
  2. Where does that information currently live? (Files, websites, databases, etc.)
  3. What format is it in? What processing might be needed?

Exercise 2: Chunking Practice

Take a long document you have access to (a report, an article, a manual). Read through and identify:
  1. Natural break points (section headers, topic changes, paragraph boundaries)
  2. How would you chunk this document?
  3. Are there places where context would be lost if you split there?
This exercise builds intuition for how chunking affects information retrieval.

Exercise 3: Metadata Design

For a RAG system serving a customer support team, what metadata would be valuable for each chunk? Consider:
  1. What filtering might users want? (product, topic, date)
  2. What source information should be displayed with answers?
  3. What metadata would help identify outdated content?
Design a metadata schema with 5-7 fields that would make the system more useful.

Common Pitfalls

Pitfall 1: Inconsistent Chunk Sizes

Mixing very small and very large chunks in the same knowledge base leads to unpredictable retrieval. Large chunks dominate some queries; small chunks get lost in others. The fix: Aim for consistent chunk sizes throughout your knowledge base. If some documents naturally have different structures, consider separate indices or normalization.

Pitfall 2: Losing Context at Boundaries

When you chunk at arbitrary points, important context can be lost. “The policy is 15 days” means nothing without knowing what policy and what the 15 days refers to. The fix: Use overlap, include section headers in each chunk, or add contextual prefixes that remind the chunk what document and section it came from.

Pitfall 3: Neglecting Maintenance

Knowledge bases aren’t “set and forget.” Information changes, documents get updated, new content gets created. A knowledge base that isn’t maintained becomes unreliable. The fix: Build processes for regular updates. Track when content was last indexed. Flag stale content for review.

Pitfall 4: Including Everything

The temptation is to include every document you can find. But more isn’t always better. Irrelevant content creates noise that can confuse retrieval. The fix: Be intentional about what goes in. Ask: “Would a user asking questions in this domain need this information?” If not, leave it out.

Level Up

Here’s a practical challenge: Scenario: You’re building a knowledge base for a cooking website. You have:
  • 500 recipes
  • 50 technique guides (“How to julienne vegetables”)
  • 30 equipment reviews
  • 20 ingredient deep-dives
Your task: Design the knowledge base structure.
  1. How would you chunk recipes differently than technique guides?
  2. What metadata would you attach to each content type?
  3. How would you handle a recipe that references a technique guide? (Should they be linked somehow?)
  4. What queries would this knowledge base need to handle well?
This exercise helps you think about knowledge base design for heterogeneous content.

Key Takeaway

A RAG system is only as good as its knowledge base. Preparing information for RAG means chunking documents into appropriately sized pieces, enriching them with useful metadata, and creating embeddings that enable semantic search. The goal is to create a well-organized collection of information where each chunk is self-contained, properly sized, and easy to find when relevant.

What’s Next

You’ve learned how to build a knowledge base. Now let’s explore what happens when a user actually asks a question. In Lesson 5: The Retrieval Step, we’ll dive deep into how queries get matched to relevant chunks. You’ll learn about similarity search, top-K retrieval, and strategies for finding the best information to feed into the generation step.