Skip to main content

Lesson 3: Embeddings: How AI Understands Meaning

Beyond Keyword Matching

Quick question: If you’re searching for information about “automobiles,” should a document about “cars” show up in your results? Obviously, yes. But traditional keyword search wouldn’t find it. The word “automobiles” doesn’t appear in a document that only uses “cars.” You’d miss relevant information simply because the exact words don’t match. This is the problem embeddings solve. They let computers understand that “car” and “automobile” mean the same thing, that “happy” is similar to “joyful,” and that a question about “vacation time” is related to a document about “paid time off.” Embeddings are the secret sauce that makes RAG’s retrieval step actually useful. Let’s understand how they work.

Core Concepts

Traditional search engines (and simple database queries) work by matching keywords. They look for documents containing the exact words you typed. This approach has serious limitations:
  • Synonyms: “Car” vs. “automobile” vs. “vehicle”
  • Phrasing variations: “How do I reset my password?” vs. “Password reset instructions”
  • Conceptual relationships: A document about “employee burnout” is relevant to a search about “workplace stress,” but shares few keywords
Keyword search is like a very literal-minded assistant who only recognizes exact matches. Useful, but frustrating when you need flexibility.

What Are Embeddings?

An embedding is a way of representing text as a list of numbers (called a vector) that captures its meaning. Here’s the key insight: similar meanings end up as similar numbers. Imagine you could place every word, sentence, or document somewhere on a map. Words with similar meanings would be close together on the map. “Happy” would be near “joyful” and “cheerful.” “Sad” would be in a different area, near “unhappy” and “melancholy.” Embeddings create this map, but instead of two dimensions (like a paper map), they use hundreds or thousands of dimensions. Each dimension captures some aspect of meaning.

Numbers That Capture Meaning

Let’s make this concrete. When you convert a sentence to an embedding, you get something like:
"The cat sat on the mat" → [0.23, -0.45, 0.12, 0.87, -0.33, ... ] (hundreds more numbers)
These numbers aren’t random. They’re carefully calculated so that:
  • “The cat sat on the mat” and “A feline rested on the rug” produce similar number sequences
  • “The stock market crashed yesterday” produces a very different sequence
The magic is that we can mathematically compare these number sequences to measure how similar two pieces of text are. Sequences that are close together represent similar meanings.

Similarity in Embedding Space

When we say embeddings are “close together” or “similar,” we’re talking about mathematical distance in embedding space. Think of it like coordinates on a map: If you’re at coordinates (3, 4) and your friend is at (3.1, 4.2), you’re very close together. If another person is at (100, 200), they’re far away. Embeddings work the same way, just with many more dimensions. We can calculate the “distance” between any two embeddings and use that as a measure of semantic similarity.
  • Small distance = similar meaning
  • Large distance = different meaning

How Embeddings Power RAG

Now you can see why embeddings are essential for RAG:
  1. Index time: Every chunk of your knowledge base gets converted to an embedding and stored
  2. Query time: When a user asks a question, that question gets converted to an embedding
  3. Search: The system finds stored embeddings that are closest to the question embedding
  4. Retrieval: The text chunks corresponding to those close embeddings get retrieved
This is semantic search: finding information based on meaning, not just keywords. So when someone asks “What’s the policy on working from home?” the system can find a document titled “Remote Work Guidelines” even though the words don’t match, because the embeddings are semantically similar.

The Embedding Model

Embeddings don’t appear magically. They’re created by specialized AI models trained specifically to understand meaning and produce useful representations. These embedding models have learned from massive amounts of text to understand:
  • That synonyms should have similar embeddings
  • That questions and their answers should be related
  • That context matters (the word “bank” means different things in “river bank” vs. “bank account”)
Different embedding models exist, each with its own strengths. Some are better at short text, others at long documents. Some are optimized for specific domains like legal or medical text. Platforms like Gloo provide access to embedding models as part of their AI offerings, with a focus on producing reliable, values-aligned results.

Dimensions and Trade-offs

Embedding vectors typically have hundreds to thousands of dimensions. Common sizes are 384, 768, 1024, or 1536 dimensions. More dimensions can capture more nuanced meanings but require more storage and computation. It’s a trade-off:
  • More dimensions: More nuanced understanding, but larger storage and slower search
  • Fewer dimensions: Faster and smaller, but might miss subtle differences in meaning
For most RAG applications, modern embedding models strike a good balance. You don’t need to obsess over this unless you’re optimizing for very specific requirements.

Try It Yourself

Exercise 1: Think in Similarities

Without any technology, just using your intuition: Rate how similar these pairs are on a scale of 1-10 (1 = completely different, 10 = almost identical):
  1. “The dog chased the ball” vs. “The puppy ran after the toy”
  2. “The dog chased the ball” vs. “Stock prices rose sharply”
  3. “How do I return an item?” vs. “What’s your refund policy?”
  4. “Python programming tutorial” vs. “Learn to code in Python”
Embedding models are essentially trained to match human intuitions like yours. The pairs you rated as similar would have close embeddings; dissimilar pairs would be far apart.

Exercise 2: Synonym Exploration

Think of a topic and list five different ways to phrase a question about it: Topic: Getting time off from work
  1. “How do I request vacation days?”
  2. “What’s the PTO policy?”
  3. “Can I take time off next month?”
  4. “How do I schedule annual leave?”
  5. “What’s the process for booking holiday?”
A good embedding model would recognize all of these as semantically related, even though they share few keywords. This is the power that enables flexible retrieval.

Exercise 3: When Keywords Fail

Think of a scenario where keyword search would completely fail: Example: You search your company wiki for “getting reimbursed for client dinners” but the relevant document is titled “Expense Policy for Business Entertainment.” Can you think of three similar scenarios in your own work or life where the words you’d naturally use don’t match the words in the source documents?

Common Pitfalls

Pitfall 1: Assuming Embeddings Are Perfect

Embeddings are powerful but not infallible. They can miss nuances, misunderstand context, or produce unexpected similarity scores. Edge cases exist. The fix: Test your retrieval with real queries. Don’t assume it will always find the right documents; verify it does.

Pitfall 2: Ignoring Domain Mismatch

An embedding model trained on general text might not perform well on highly specialized content. Legal jargon, medical terminology, or industry-specific language can confuse general-purpose models. The fix: For specialized domains, consider embedding models tuned for that domain, or test thoroughly with domain-specific queries.

Pitfall 3: Embedding the Wrong Units

What you embed matters. Embedding entire long documents produces less useful results than embedding appropriately sized chunks. The embedding represents the “average meaning” of the text; too much text and that meaning becomes diluted. The fix: Chunk your documents thoughtfully (we’ll cover this in Lesson 4) so each embedding represents a coherent, focused piece of information.

Pitfall 4: Forgetting That Embeddings Need Updates

If your knowledge base changes, the embeddings need to be updated too. New documents need to be embedded; deleted documents need their embeddings removed. The fix: Build embedding updates into your content management process. Stale embeddings mean stale search results.

Level Up

Here’s a thought experiment to deepen your understanding: Scenario: You’re building a RAG system for a multilingual customer support team. Customers write in English, Spanish, and French, but your documentation is only in English. Questions to consider:
  1. How might embeddings help here? (Hint: Some embedding models are multilingual, meaning “car” in English and “coche” in Spanish produce similar embeddings.)
  2. What challenges might still exist even with multilingual embeddings?
  3. How would you test whether the system works well across languages?
This exercise helps you think about embeddings not just as a technical feature but as a tool with real capabilities and limitations.

Key Takeaway

Embeddings convert text into numbers that capture meaning, enabling semantic search that goes far beyond keyword matching. Similar meanings produce similar embeddings, allowing RAG systems to find relevant information even when the exact words don’t match. This is what makes RAG retrieval truly intelligent: it understands what you mean, not just what you typed.

What’s Next

You now understand how embeddings enable smart retrieval. But before retrieval can happen, you need something to search through: a knowledge base. In Lesson 4: Building a Knowledge Base, we’ll explore how to prepare your information for RAG. You’ll learn about chunking (breaking documents into searchable pieces), why chunk size matters, and how to turn raw content into a well-organized knowledge base.