Architecting a RAG System for Enterprise Documentation

We've been building RAG pipelines for a while now. Here's what we've learned, and what we wish someone had told us before we made the mistakes ourselves.

Why RAG, not fine-tuning

Fine-tuning sounds like the obvious answer when an LLM doesn't know your internal docs. It isn't. Fine-tuning is expensive, slow to update, and doesn't actually solve the problem. The model absorbs patterns from your data, not facts. When documents change (and they always do), you're stuck retraining.

RAG keeps the LLM frozen. You retrieve relevant context at query time and inject it into the prompt. Fresh documents, no retraining, and you keep full control over what the model sees.

Ingestion

Before any of the interesting stuff happens, you have to get your documents into a usable state. For most enterprise setups this means pulling from SharePoint, Confluence, S3, internal wikis, and a graveyard of PDFs no one has touched in three years.

The extraction step is unglamorous but it matters a lot. You need clean text, not HTML soup or garbled PDF artifacts. You also need to carry metadata through the pipeline: who wrote it, when it was last updated, what team it belongs to. That metadata drives filtering at retrieval time.

One thing teams consistently get wrong early on: access controls. If a user shouldn't see a document in the source system, they shouldn't be able to retrieve it through your RAG app either. Propagate your ACLs from day one. Retrofitting this later is painful.

Chunking

You can't embed a 40-page runbook as a single vector. You split it into chunks, and how you split it shapes everything downstream.

Fixed-size splitting by token count is the default everyone starts with. It works, but it cuts through sentences and paragraphs with no awareness of meaning. Fine for prototypes.

What actually works well in production is parent-child chunking. Small chunks (256 to 512 tokens) get embedded and retrieved because they're precise enough to match specific queries. When you pass context to the LLM, you send the larger parent section so the model has enough surrounding context to generate a coherent answer.

If you're indexing highly technical content like API docs or engineering specs, pay extra attention to where your chunks land. A code snippet split across two chunks is useless to both.

Embeddings

Each chunk becomes a vector that encodes its meaning. Similar meaning maps to similar position in vector space. At query time, your question becomes a vector too, and you find the nearest chunks.

Model choice matters, though maybe less than people think at first. text-embedding-3-large from OpenAI is a solid baseline. If your corpus is heavily domain-specific like legal, medical, or financial content, it's worth evaluating models fine-tuned on that domain.

The mistake we see constantly is going dense-only. Pure embedding-based search is great at semantic similarity but terrible at exact matches. Product names, version numbers, internal acronyms. A dense model often fumbles these. Add BM25 sparse retrieval alongside your embeddings. Hybrid retrieval isn't complicated to implement and the quality improvement is immediate.

Retrieval

ANN search is how you query a vector index at scale. It's fast even over tens of millions of vectors. But "approximate" is doing real work in that name. You will miss relevant documents sometimes.

Two things help. Metadata filtering narrows the search space before running ANN. If the user is on the engineering team asking about infrastructure docs, filter to those first. Query expansion rephrases the user's question two or three ways, runs retrieval for each, then merges results. One query misses things another catches. Cheap to implement, meaningful gains in recall.

Retrieve more candidates than you think you need. Top 20 to 50 is reasonable. You'll prune them in the next stage.

Reranking

ANN retrieval is optimized for speed. You're comparing pre-computed embeddings, which means you're scoring the query and each document independently. A cross-encoder does something more expensive but more accurate. It looks at the query and document together and produces a real relevance score.

You run this over your 20 to 50 retrieved candidates, reorder them, and pass only the top 3 to 10 to the LLM. Latency hit is manageable. Quality improvement is significant.

Cohere Rerank is the path of least resistance if you want a hosted API. BGE Reranker is the option if you need to self-host. Either way, add it. Teams that skip reranking because they're moving fast almost always come back to add it after enough bad answers.

Generation

By the time you get here, the hard work is done. You have a small set of high-quality relevant chunks. You construct a prompt, pass it to the LLM, and get an answer.

A few things worth being deliberate about. Tell the model to cite its sources, not just because it builds user trust, but because it forces the model to stay grounded. If the answer isn't in the retrieved context, the model should say so. Explicitly prompt for this behavior.

Streaming matters for feel. A three-second pause followed by a wall of text feels slow even when it isn't. Start rendering tokens as they arrive.

For enterprise deployments, run outputs through a guardrail check before the response reaches the user. PII leakage, off-topic answers, policy violations. Not every query needs this but some do, and you want the infrastructure in place.

What actually breaks in production

The LLM is rarely the problem. The pipeline is.

Stale index. Documents get updated at the source but nobody triggers a re-index. The user asks a question and gets last quarter's answer. Fix this with event-driven ingestion rather than nightly batch jobs.

Token limit mismatches. Your embedding model was trained on 512-token contexts and you're sending it 1,500-token chunks. It still works, sort of, but you're outside the model's design envelope and results degrade quietly.

No eval loop. You launched, it mostly works, and now you have no idea whether it's getting better or worse as documents change and query patterns shift. Log everything. Build a labeled test set early. Measure retrieval quality, not just answer quality.