What Is RAG? Retrieval-Augmented Generation, Explained

Published 2026-07-04 · Updated 2026-07-04

RAG, short for retrieval-augmented generation, is a technique where an AI system first retrieves relevant documents or facts from an external store, then generates its answer with those retrieved passages placed in front of the model. Instead of relying only on what it absorbed during training, the model gets to read a small packet of just-in-time material and write its response against that. The term was introduced in a 2020 paper by Patrick Lewis and colleagues, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

The idea is easier to hold than the acronym suggests. Look something up, then answer using what you found. RAG is that habit, made into a system and wired around a language model.

Why models need retrieval at all

A language model carries two kinds of limitation that retrieval is designed to work around.

The first is that its training knowledge is frozen. Everything the model absorbed during its creation was fixed before you arrived, and it has a cutoff date. It will not know what happened last week, and it has never seen your private documents, your notes, or the specifics of your account. It cannot know these things on its own, because they were not part of the text it learned from.

The second is that its working memory is finite. Each turn, the model reads only what fits inside a fixed limit called the context window. You cannot pour an entire company wiki, a year of email, or a library of manuals into that window; most of it will not fit, and stuffing it full has its own costs. So the practical question becomes: of everything that might be relevant, which few passages should occupy the limited space in front of the model right now?

Retrieval answers exactly that question. Rather than hoping the model already memorized the right fact, or trying to cram everything into the window, a RAG system fetches the handful of passages most likely to matter and hands only those to the model. This is the same underlying constraint that explains why an assistant forgets earlier turns in a long chat: the window is small, so what goes into it has to be chosen well.

The three steps: store, retrieve, generate

A RAG system runs in three movements. Once you see them separately, the whole technique stops being mysterious.

Store. First, the source material is broken into bite-sized pieces, a step usually called chunking. A long manual becomes a few hundred short passages rather than one wall of text, because retrieval works better when the units are small and self-contained. Each chunk is then converted into an embedding: a list of numbers that represents the passage's meaning as a position in space. The useful intuition is that meaning gets mapped to coordinates, so passages that mean similar things land near each other, even when they share no exact words. "How do I get my money back" and "refund policy" end up as neighbors. All these coordinates are kept in a store built for fast lookup.

Retrieve. When a question comes in, it is embedded the same way, turning it into a point in the same space. The system then looks for the stored passages sitting closest to that point, a nearest-neighbor search. Closeness stands in for relevance: the passages nearest the question are the ones most likely to answer it. A few of the top matches are pulled out.

Generate. Those retrieved passages are pasted into the prompt, directly in front of the model, usually with an instruction along the lines of "answer using the material below." The model then writes its reply while looking at that fresh, specific text. It is not reciting from memory; it is reading a short packet assembled for this exact question, and composing an answer against it.

RAG is how "memory" features work under the hood

This three-step shape is not confined to document search. It is, in broad strokes, how persistent memory features tend to work across the industry.

A memory feature that remembers facts about you is, mechanically, a retrieval system pointed at yourself. When something worth keeping surfaces in a conversation, it is saved as a small note. That note is embedded, the same meaning-to-coordinates step, and filed in a personal store. Later, in a new conversation, the current topic is embedded and used to retrieve the handful of saved notes closest to it, which are then inserted into the context before the model replies. Save facts, embed them, retrieve them into context at the right moment: that is a memory feature, and it is also RAG.

This is why the two ideas are so easy to conflate, and why understanding one clarifies the other. A model does not remember you in any human sense between sessions. What feels like memory is a retrieval step quietly running just before the model speaks, pulling a few relevant notes back into view.

What RAG fixes, and what it does not

Retrieval is powerful, but it is not magic, and the line between the two is worth drawing plainly.

RAG fixes several real problems. It handles recency, since a store can be updated the moment facts change, without retraining the model. It brings in private data the model was never trained on, such as your own documents or account details. It supports source attribution, because the system knows which passages it retrieved and can cite or link them. And it uses the context window efficiently, occupying that limited space with a few passages chosen for this question rather than a mountain of loosely related text.

What RAG does not fix is just as important. Retrieval can miss: if the right passage is phrased in a way that lands far from the question in meaning-space, it may never be pulled, and the model answers without it, none the wiser. And retrieval is only as good as the store behind it. If a stored fact is wrong or outdated, RAG will retrieve it and the model will repeat it confidently, giving a stale error the full authority of a fresh answer. Garbage in, garbage out: retrieval faithfully surfaces whatever was saved, mistakes included. This is precisely why a store you can inspect and edit matters. When you can see what has been saved about you and correct or remove the wrong entries, you are fixing the retrieval at its source rather than arguing with its symptoms.

RAG vs fine-tuning vs long context

RAG is one of three common ways to get knowledge into a model's answer, and they suit different jobs.

Fine-tuning adjusts the model's own weights by training it further on new material. It is well suited to teaching a durable skill, style, or format, the kind of thing you want baked in permanently. It is a poor fit for facts that change often or for information private to a single user, because every update means retraining, and the knowledge becomes diffused into the model rather than kept as an editable, inspectable record.

Long context skips retrieval and simply pastes the relevant material, sometimes a great deal of it, directly into the window. It is simple and effective when the material is small enough to fit and you have it on hand. But large windows have limits: models attend less reliably to information buried in the middle of a long input, an effect documented in the 2023 paper Lost in the Middle: How Language Models Use Long Contexts. Dumping everything in is also wasteful when only a few passages actually matter.

RAG sits between these. The knowledge lives outside the model in a store that can be updated and audited freely, and only the few passages relevant to each question are placed in the window. It fits changing facts, private data, and large bodies of material better than either alternative, at the cost of a retrieval step that has to find the right passages to earn its keep.

FAQ

Is RAG the same as memory? They are closely related but not identical. RAG is the general technique of retrieving external material into a prompt before generating. A memory feature is one application of that technique, aimed at you specifically: it saves and retrieves facts about you. All such memory features are retrieval systems, but RAG also powers things that have nothing to do with personal memory, like document search and question answering over a manual.

Does RAG stop hallucinations? It reduces them, but does not eliminate them. By putting real, relevant passages in front of the model, RAG gives it something concrete to work from instead of improvising, which curbs invented answers. But if retrieval misses the right passage, or the stored fact is itself wrong, the model can still produce a confident mistake, now with a false air of being grounded in a source.

Do I need to know about RAG as a user? Not to use a memory feature, no. But knowing that "memory" is usually retrieval under the hood explains its behavior: why it sometimes fails to recall a fact it should have (retrieval missed), why an editable, inspectable store matters (you are correcting the source, not the symptom), and why keeping what is saved accurate directly improves what the assistant tells you.