Lesson 3: Embeddings: How AI Understands Meaning
Beyond Keyword Matching
Quick question: If you’re searching for information about “automobiles,” should a document about “cars” show up in your results? Obviously, yes. But traditional keyword search wouldn’t find it. The word “automobiles” doesn’t appear in a document that only uses “cars.” You’d miss relevant information simply because the exact words don’t match. This is the problem embeddings solve. They let computers understand that “car” and “automobile” mean the same thing, that “happy” is similar to “joyful,” and that a question about “vacation time” is related to a document about “paid time off.” Embeddings are the secret sauce that makes RAG’s retrieval step actually useful. Let’s understand how they work.Core Concepts
The Limits of Keyword Search
Traditional search engines (and simple database queries) work by matching keywords. They look for documents containing the exact words you typed. This approach has serious limitations:- Synonyms: “Car” vs. “automobile” vs. “vehicle”
- Phrasing variations: “How do I reset my password?” vs. “Password reset instructions”
- Conceptual relationships: A document about “employee burnout” is relevant to a search about “workplace stress,” but shares few keywords
What Are Embeddings?
An embedding is a way of representing text as a list of numbers (called a vector) that captures its meaning. Here’s the key insight: similar meanings end up as similar numbers. Imagine you could place every word, sentence, or document somewhere on a map. Words with similar meanings would be close together on the map. “Happy” would be near “joyful” and “cheerful.” “Sad” would be in a different area, near “unhappy” and “melancholy.” Embeddings create this map, but instead of two dimensions (like a paper map), they use hundreds or thousands of dimensions. Each dimension captures some aspect of meaning.Numbers That Capture Meaning
Let’s make this concrete. When you convert a sentence to an embedding, you get something like:- “The cat sat on the mat” and “A feline rested on the rug” produce similar number sequences
- “The stock market crashed yesterday” produces a very different sequence
Similarity in Embedding Space
When we say embeddings are “close together” or “similar,” we’re talking about mathematical distance in embedding space. Think of it like coordinates on a map: If you’re at coordinates (3, 4) and your friend is at (3.1, 4.2), you’re very close together. If another person is at (100, 200), they’re far away. Embeddings work the same way, just with many more dimensions. We can calculate the “distance” between any two embeddings and use that as a measure of semantic similarity.- Small distance = similar meaning
- Large distance = different meaning
How Embeddings Power RAG
Now you can see why embeddings are essential for RAG:- Index time: Every chunk of your knowledge base gets converted to an embedding and stored
- Query time: When a user asks a question, that question gets converted to an embedding
- Search: The system finds stored embeddings that are closest to the question embedding
- Retrieval: The text chunks corresponding to those close embeddings get retrieved
The Embedding Model
Embeddings don’t appear magically. They’re created by specialized AI models trained specifically to understand meaning and produce useful representations. These embedding models have learned from massive amounts of text to understand:- That synonyms should have similar embeddings
- That questions and their answers should be related
- That context matters (the word “bank” means different things in “river bank” vs. “bank account”)
Dimensions and Trade-offs
Embedding vectors typically have hundreds to thousands of dimensions. Common sizes are 384, 768, 1024, or 1536 dimensions. More dimensions can capture more nuanced meanings but require more storage and computation. It’s a trade-off:- More dimensions: More nuanced understanding, but larger storage and slower search
- Fewer dimensions: Faster and smaller, but might miss subtle differences in meaning
Try It Yourself
Exercise 1: Think in Similarities
Without any technology, just using your intuition: Rate how similar these pairs are on a scale of 1-10 (1 = completely different, 10 = almost identical):- “The dog chased the ball” vs. “The puppy ran after the toy”
- “The dog chased the ball” vs. “Stock prices rose sharply”
- “How do I return an item?” vs. “What’s your refund policy?”
- “Python programming tutorial” vs. “Learn to code in Python”
Exercise 2: Synonym Exploration
Think of a topic and list five different ways to phrase a question about it: Topic: Getting time off from work- “How do I request vacation days?”
- “What’s the PTO policy?”
- “Can I take time off next month?”
- “How do I schedule annual leave?”
- “What’s the process for booking holiday?”
Exercise 3: When Keywords Fail
Think of a scenario where keyword search would completely fail: Example: You search your company wiki for “getting reimbursed for client dinners” but the relevant document is titled “Expense Policy for Business Entertainment.” Can you think of three similar scenarios in your own work or life where the words you’d naturally use don’t match the words in the source documents?Common Pitfalls
Pitfall 1: Assuming Embeddings Are Perfect
Embeddings are powerful but not infallible. They can miss nuances, misunderstand context, or produce unexpected similarity scores. Edge cases exist. The fix: Test your retrieval with real queries. Don’t assume it will always find the right documents; verify it does.Pitfall 2: Ignoring Domain Mismatch
An embedding model trained on general text might not perform well on highly specialized content. Legal jargon, medical terminology, or industry-specific language can confuse general-purpose models. The fix: For specialized domains, consider embedding models tuned for that domain, or test thoroughly with domain-specific queries.Pitfall 3: Embedding the Wrong Units
What you embed matters. Embedding entire long documents produces less useful results than embedding appropriately sized chunks. The embedding represents the “average meaning” of the text; too much text and that meaning becomes diluted. The fix: Chunk your documents thoughtfully (we’ll cover this in Lesson 4) so each embedding represents a coherent, focused piece of information.Pitfall 4: Forgetting That Embeddings Need Updates
If your knowledge base changes, the embeddings need to be updated too. New documents need to be embedded; deleted documents need their embeddings removed. The fix: Build embedding updates into your content management process. Stale embeddings mean stale search results.Level Up
Here’s a thought experiment to deepen your understanding: Scenario: You’re building a RAG system for a multilingual customer support team. Customers write in English, Spanish, and French, but your documentation is only in English. Questions to consider:- How might embeddings help here? (Hint: Some embedding models are multilingual, meaning “car” in English and “coche” in Spanish produce similar embeddings.)
- What challenges might still exist even with multilingual embeddings?
- How would you test whether the system works well across languages?

