Why you need RAG

One question, two tracks: on the left the model answers from memory, on the right from a source. The comparison is the lesson.

Question: "What is the file-upload limit on the Pro plan?" the answer changed recently - the model memory is stale

Why the right-hand answer is better

Without RAG

The model answers from memory - there is no retrieval

Query The user's question "What is the file-upload limit on the Pro plan?"
LLM - weights only An answer from frozen knowledge training cutoff: trained up to [cutoff date]. Cutoff dates vary by model; see the Anthropic models overview. No fresh source - the answer is assembled from the weights. weights frozen - no retrieval
Answer "The limit is 100 files." no source given - the data may be stale a hallucination is possible
With RAG

The model answers from the source it found

Query The user's question "What is the file-upload limit on the Pro plan?"
Embedding Question -> vector the text is encoded into a dense vector (dimension depends on the model, e.g. 1536 for text-embedding-3-small) for meaning-based search
Vector index - top-k Search for fresh fragments found chunk limits.md (updated today), cosine 0.94
Context assembly Fragment -> LLM prompt the found text is slotted into the request as support - grounding
Answer with a citation "On Pro - up to 500 files, 20 MB in total [1]." [1] limits.md, sec. 2 - source updated today grounded on a source - has a citation

Why the right-hand answer is better

Fresh and private data without retraining. RAG slots a source in at answer time.
Fewer fabricated facts. The answer rests on the found text (grounding).
Links to sources. The citation [1] leads to the document and section.
Cheaper than fine-tuning. Updating a document in the index takes minutes.

Turn on JavaScript to zoom into the stages of a track.

The problem: the model goes stale and does not know yours

An ordinary model is trained up to some date (the training cutoff) and frozen after that: about events and documents later than that date it knows nothing. For example, the Anthropic models overview explicitly states the training-data cutoff date for each model (platform.claude.com/docs/en/docs/about-claude/models). On top of that, it has never seen your private documents. The result is two sources of error: stale facts and fabrications about your data.

RAG removes both in one move: the needed fact is fed at request time from your fresh index. Compare two paths of the same question - without RAG and with RAG:

One question, two paths: without RAG (from memory) and with RAG (grounded on a chunk).
# Same question, two paths. Track A: no context. Track B: with retrieval.
from anthropic import Anthropic

client = Anthropic()  # ANTHROPIC_API_KEY
question = "Under our policy, how many vacation days are there during probation?"

def ask(context=None):
    if context:
        prompt = (
            "Answer only from the context. If the answer is not in the context, "
            f"say 'this is not in the documents'.\nContext:\n{context}\n\nQuestion: {question}"
        )
    else:
        prompt = question
    r = client.messages.create(
        model="claude-sonnet-4-6",  # current model -- see the models overview
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}],
    )
    return r.content[0].text

# Track A (no RAG): answer from memory - may be a fabrication about YOUR policy.
print("NO RAG:", ask())

# Track B (with RAG): retrieval supplies a real chunk, the answer is grounded.
retrieved = "Vacation during probation: 2 days for each month worked."
print("RAG:   ", ask(context=retrieved))
  1. Client and question -- One Anthropic client and one fixed question about internal policy - the point where an ordinary model does not know your data.
  2. With-context branch -- When context is passed, the instruction tells the model to answer only from it and to say 'this is not in the documents' if there is no answer - the basis of grounding.
  3. No-context branch -- Without context the prompt is the bare question; the Messages API call is identical for both paths, only the input prompt differs.
  4. Track A: no RAG -- Call with no context: the answer comes from parametric memory and may be a fabrication about your specific policy.
  5. Track B: with RAG -- Retrieval supplies a real chunk first, then it goes into the context and the answer is grounded on it. The answer is never written before the context is substituted in.

Track B never writes the answer until the context is substituted in: retrieval first, then generation. The call shape is the real Anthropic Messages API (platform.claude.com/docs/en/api/messages).

Why an ordinary model cannot cope

Two reasons, both structural rather than "you asked badly":

  • Training cutoff. After the cutoff date the model does not know anything new; updating its knowledge means touching the model again. The cutoff dates are published in the models overview (platform.claude.com/docs/en/docs/about-claude/models).
  • No private data. Your internal documents were not and will not be in the public training corpus, so any answer about them without retrieval is a guess.

Fresh and private data without retraining

RAG keeps knowledge OUTSIDE the model - in an external index you manage yourself. Adding or updating a fact = re-indexing one document, not retraining the model. That is exactly why Lewis et al., 2020 separated parametric memory (the model's weights) from non-parametric memory (the external document index): the index can be changed without retraining (arxiv.org/abs/2005.11401). In practice: you update a line in a file and the assistant immediately answers from the new version.

Fewer fabricated facts: grounding + citations

When the model answers from the supplied context, the answer is grounded on concrete fragments, and you can attach a link to the source chunk for every claim. This was the main result of the original work: RAG gives more specific and factual answers than a purely parametric model (Lewis et al., 2020, arxiv.org/abs/2005.11401). The only catch is to supply just a few of the most relevant pieces and put the important ones nearer the edges of the prompt: models use what is buried in the middle of a long context less well (Liu et al., 2023, arxiv.org/abs/2307.03172).

Cost: RAG versus fine-tuning

Fine-tuning requires collecting a dataset, running training, and repeating that at every data update - a separate cycle of work and expense (see the OpenAI model-optimization guide, which covers fine-tuning, developers.openai.com/api/docs/guides/model-optimization). RAG, by contrast, adds one retrieval step before an ordinary model call and indexes new documents incrementally. When the task is to give the model FACTS that change often, RAG is usually cheaper and faster to maintain; fine-tuning stays for tasks of FORM and style. (This is an engineering trade-off, not an absolute: do the math on your own volumes - see the production chapter.)

Below is an interactive two-track trace of one request. Track A (without RAG): the request goes straight to the model -> a stale or fabricated answer. Track B (with RAG): the request first fills the context box with the retrieved chunks, and only AFTER that does the grounded answer appear. The main path is Track B; some nodes open by drill (semantic-zoom camera) into detail. The grounded green answer is never drawn before retrieval is complete - that is the didactic meaning of the motion. With JS off, both tracks are shown as a static inline-SVG schematic with the full text.

Sources

Try it yourself

  • Run Track B and watch how the context node fills with chunks BEFORE the grounded answer appears.
  • Compare Track A and Track B on one question: note that without retrieval the answer appears immediately and without a link to a source.
  • Drill into the context node in Track B and see exactly which chunks made it into the prompt and in what order.

What is next

The next stop is chunking: how to cut large documents into retrieval-sized chunks, where exactly to cut, and how to choose the size and overlap.

About this recipe