What LLMs actually are (and aren't)
A large language model is, at its core, a very sophisticated next-token predictor. Given a sequence of tokens — subword units that roughly correspond to syllables or short words — an LLM predicts the probability distribution of what comes next. Stack billions of these predictions together, sample from those distributions, and you get fluent, coherent text.
But here's the thing that trips up a lot of engineers building on top of LLMs for the first time: the model doesn't "know" things the way a database knows things. It has compressed statistical knowledge from its training data into billions of floating-point weights. That compression is lossy. The model can't cite where it learned something. It can't tell you if it's confidently right or confidently wrong. And critically — it has a hard training cutoff.
The hallucination problem
Hallucination is the term the field uses when an LLM generates text that is fluent and confident but factually wrong. It's not a bug in the traditional sense — it's an emergent property of how these models work. Because the model is always predicting the most plausible next token, it will sometimes predict tokens that form factually incorrect but statistically plausible sentences.
From an engineering standpoint, hallucinations are a production problem. They're unpredictable, hard to detect at runtime, and deeply erode user trust when they surface. A system that's right 95% of the time but confidently wrong 5% of the time is a liability in most domains.
What is RAG and why it matters
Retrieval-Augmented Generation (RAG) is an architectural pattern where you retrieve relevant documents from an external knowledge base at query time and inject them into the model's context window before generating a response. The model reasons over the retrieved content rather than relying solely on its parametric (baked-in) memory.
The insight is simple but powerful: instead of hoping the model "knows" the answer, you tell it the answer and ask it to reason about it. You're separating the knowledge store (your vector database) from the reasoning engine (the LLM) — two components that can be maintained, updated, and scaled independently.
Building a RAG pipeline from scratch
A production RAG system has two distinct phases: an offline indexing phase that runs when you add or update documents, and an online retrieval phase that runs on every user query. Let's walk through both.
1
Document ingestion
Load raw documents — PDFs, markdown files, web pages, database records. Extract clean text, removing boilerplate and formatting artifacts.
2
Chunking
Split documents into overlapping chunks (typically 256–1024 tokens). Chunk size is one of the most impactful parameters in your entire system.
3
Embedding
Pass each chunk through an embedding model to produce a dense vector (typically 768–1536 dimensions) that encodes the chunk's semantic meaning.
4
Vector storage
Store the vectors in a vector database (Pinecone, Weaviate, pgvector, Chroma). Store the original text alongside for retrieval.
5
Query embedding
At query time, embed the user's question using the same embedding model used during indexing. Critically — the same model.
6
Similarity search
Find the top-k chunks whose vectors are closest to the query vector. Cosine similarity or dot product are the standard distance metrics.
7
Context injection
Build the prompt by injecting retrieved chunks into the system or user message, along with instructions about how to use them.
8
Generation
Call the LLM with the augmented prompt. The model reads the retrieved context and generates a grounded response.
Chunking, embeddings, and vector search
Chunking is where most RAG systems either win or lose. Chunk too small and you lose context — the retrieved passage doesn't have enough surrounding information to be useful. Chunk too large and you dilute relevance — the retrieved passage contains too much irrelevant content that confuses the model.
Choosing an embedding model
The embedding model encodes semantic similarity. Two chunks that mean similar things should have vectors that are close in the vector space. Your retrieval quality is bounded by how well your embedding model captures meaning in your domain.
Advanced RAG patterns
Naive RAG — embed, retrieve, generate — works well as a starting point. But in production, you'll quickly hit its limits. Here are the patterns that actually move the needle.
Hybrid search
Pure semantic search misses keyword matches. A user asking for "ISO 27001 clause 8.2" wants exact keyword matching, not semantic similarity. Hybrid search combines dense vector retrieval with sparse BM25/keyword retrieval, then uses Reciprocal Rank Fusion (RRF) to merge the ranked lists. This reliably outperforms either approach alone.
Re-ranking
Vector similarity is a fast but approximate measure of relevance. After retrieving your top-20 candidates, pass them through a cross-encoder re-ranker (Cohere Rerank, BGE-Reranker) which scores each candidate against the query with more depth. Then pass only the top-5 re-ranked results to the LLM. This two-stage retrieval pattern significantly improves the precision of what the model sees.
HyDE (Hypothetical Document Embedding)
User queries are often terse and conversational. The documents they're searching are often formal and dense. This mismatch hurts retrieval. HyDE asks the LLM to generate a hypothetical ideal answer to the query, then embeds that hypothetical answer rather than the raw query. The hypothetical answer is stylistically closer to the documents you're searching, so the vector similarity is more reliable.
# HyDE implementation sketch def hyde_query(user_question: str) -> str: # Generate a hypothetical ideal document hypothetical = llm.complete( f"Write a short passage that directly answers: {user_question}" ) # Embed the hypothetical, not the question query_vector = embedding_model.embed(hypothetical) chunks = vector_db.similarity_search(query_vector, k=5) # ... continue with normal RAG generation
Query decomposition
Complex questions are hard to answer with a single retrieval pass. "What are the key differences between our Q3 and Q4 revenue and what drove the change?" needs at least two separate retrievals — one for Q3 data, one for Q4 data — before the model can synthesise a comparison. Query decomposition breaks a complex question into sub-questions, retrieves for each, then synthesises a final answer. This is a core technique in agentic RAG systems.
Contextual compression
Retrieved chunks often contain relevant and irrelevant content mixed together. Before passing chunks to the LLM, you can run a fast extraction step that keeps only the sentences relevant to the query. This lets you retrieve larger chunks (better context)
When RAG isn't enough
RAG is powerful but not a universal answer. Knowing when not to use it — or when to layer it with other techniques — is a mark of engineering maturity.
Evaluating your RAG system
You can't improve what you can't measure. Evaluation is the most underinvested area in most RAG projects — teams ship and hope, rather than shipping and measuring. Here's the evaluation stack you should have.
Retrieval metrics
Evaluate retrieval independently from generation. Build a test set of (question, expected source document) pairs. Measure hit rate (is the right document in the top-k?), MRR (Mean Reciprocal Rank), and precision@k. A retrieval problem is different from a generation problem and needs to be debugged separately.
Generation metrics (RAG triad)
1
Faithfulness
Does the generated answer only make claims that are supported by the retrieved context? High faithfulness = low hallucination rate.
2
Answer relevance
Does the answer actually address what the user asked? A faithful answer that misses the point is still a bad answer.
3
Context relevance
Were the retrieved chunks actually relevant to the query? This tells you whether your retrieval step is doing its job.