The AI Memory Glossary: 28 Essential Terms

Published 2026-07-04 · Updated 2026-07-04

This glossary defines the vocabulary behind how AI systems handle memory and context, from the raw units of text a model reads to the ways it stores, retrieves, and forgets what you tell it. Each term is defined precisely, grouped by theme, so you can look up a single concept or read straight through as a primer.

Basics

Token

A token is the small unit of text a language model actually reads and writes. It is often a word, part of a word, or a punctuation mark, roughly three-quarters of a word in English on average. Models measure their input and output limits, and usually their pricing, in tokens rather than characters or words.

Analogy: tokens are to an AI what individual bricks are to a wall; the model assembles every response one unit at a time.

Context Window

The context window is the maximum amount of text, measured in tokens, that a model can consider at once. It includes everything in the current exchange: your question, any instructions, retrieved documents, and the model's own reply-in-progress. Once a conversation exceeds this limit, older material must be dropped or compressed to make room.

Analogy: think of it as a fixed-size buffer. Only so much fits at one time; the rest has to be moved out or discarded.

Prompt

A prompt is the text you give a model to act on: a question, an instruction, or a block of content to work with. The quality and clarity of a prompt strongly shape the response. In practice, the full prompt sent to a model often bundles your message together with hidden instructions and background context.

System Prompt

A system prompt is a set of instructions placed at the start of a conversation to define how the model should behave, its tone, role, and rules, before the user says anything. Users usually do not see it. It persists across the exchange and steers every reply.

Analogy: it is the briefing a stage actor receives before the curtain rises, shaping how they play the scene without the audience ever hearing it.

Knowledge Cutoff

The knowledge cutoff is the date after which a model has no built-in information, because its training data stops there. A model with a 2024 cutoff will not natively know about events in 2025. This is separate from memory: a model can still be told newer facts in a prompt or fetch them through retrieval, but they were not incorporated into its original training.

How memory is stored

Persistent Memory

Persistent memory is information an AI system retains across separate conversations, so it does not start from a blank slate each time. Features branded as "ChatGPT Memory" or similar are examples of this category. Unlike the context window, which resets when a session ends, persistent memory is saved somewhere durable and pulled back in later.

Analogy: the context window is short-term working memory; persistent memory is the notebook you keep between meetings.

Episodic Memory

Episodic memory stores specific past events or interactions, tied to when and where they happened, such as "on Tuesday you asked me to draft a resignation letter." It captures particular episodes rather than general facts. Borrowed from human psychology, the term describes memory of experiences rather than knowledge.

Semantic Memory

Semantic memory stores general facts and stable knowledge, stripped of the specific moment they were learned, such as "the user is a vegetarian" or "the user works in finance." It is the counterpart to episodic memory: not what happened when, but what is simply true. Systems often distill many episodes down into a few durable semantic facts.

Profile / Preference Memory

Profile or preference memory is a curated set of stable facts about a user, such as their name, tone preferences, recurring goals, or standing instructions. It is a focused, often user-visible subset of semantic memory aimed at personalization. Because it is small and long-lived, it is usually the part of memory a user can most easily view and edit.

Memory Extraction

Memory extraction is the process of reading a conversation and pulling out the parts worth remembering, then saving them as structured facts. Rather than storing entire transcripts, a system decides which content is worth keeping and discards the rest. This step determines what actually enters long-term memory, and getting it wrong means either forgetting useful details or hoarding noise.

Analogy: it is comparable to taking notes after a meeting; you record the decisions, not every word spoken.

Embedding

An embedding is a list of numbers that represents the meaning of a piece of text as a point in mathematical space, so that texts with similar meaning sit close together. Models generate embeddings to compare ideas by meaning rather than by exact wording. They are the foundation of semantic search and most modern retrieval.

Analogy: an embedding is like a coordinate for meaning; "car" and "automobile" land at nearly the same spot, while "banana" is far away.

Vector Database

A vector database is a specialized store designed to hold embeddings and quickly find the ones most similar to a given query. It powers the "find related content by meaning" step in many memory and retrieval systems. Regular databases match exact values; a vector database ranks by closeness in meaning.

How recall works

Semantic Search

Semantic search finds content by meaning rather than by matching keywords. It works by turning both the query and the stored items into embeddings and measuring which are closest. This lets a search for "how to fix a flat tire" surface a document about "repairing a punctured wheel," even with no shared words.

Retrieval

Retrieval is the act of fetching relevant stored information, memories, documents, or facts, so it can be used to answer the current request. It is the bridge between a large store of knowledge and the small context window. Retrieval quality sets a ceiling on answer quality: the model can only use what is fetched.

RAG (Retrieval-Augmented Generation)

RAG is a technique where a model retrieves relevant external information first, then generates its answer using that fetched material as reference. It lets a model draw on knowledge outside its training data and cite fresher or more specific sources. The term comes from a 2020 paper by Lewis et al. that combined a neural retriever with a text generator.

Analogy: it is an open-book exam; instead of answering purely from memory, the model looks up the relevant page first.

Recall

Recall is bringing a stored memory back into the active conversation so the model can use it. In everyday use it describes the moment an AI appears to remember something you told it earlier. Technically, recall is the outcome of retrieval succeeding: the right memory surfaced at the right time.

Injection (Memory Injection)

Memory injection is the step of inserting retrieved memories or facts into the prompt before the model generates its reply. The model does not reach into a memory store on its own; relevant items are placed into its context window for that turn. Well-designed systems clearly mark injected memory as reference data rather than as new instructions, to reduce the risk of it being misread as a command.

Analogy: it is like a colleague placing a note on your desk before you answer a question, so the fact is in front of you.

Summarization

Summarization compresses a long conversation or document into a shorter version that keeps the key points. Memory systems use it to fit more history into a limited context window without keeping every word. The trade-off is detail: summaries save space but can smooth over specifics that later turn out to matter.

Truncation

Truncation is cutting off text to fit within a size limit, usually by dropping the oldest or least relevant parts of a conversation. When a chat outgrows the context window, truncation is the blunt method for making room. Unlike summarization, it does not preserve the meaning of what it removes; the cut material is simply gone.

Analogy: it is comparable to cropping a photo to fit a frame; whatever falls outside the edge is lost, not shrunk.

Controls & privacy

Hallucination

A hallucination is when a model states something false or fabricated with the same confidence it uses for true statements. It happens because models generate plausible-sounding text rather than looking up verified facts by default. Retrieval and memory can reduce hallucinations by grounding answers in real sources, but they do not eliminate them.

Stale Memory

Stale memory is a stored fact that was once true but is now outdated, such as an old job title, a former address, or a preference the user has since changed. Because memory persists, systems can keep applying facts long after they stop being accurate. Good memory design includes ways to update or remove facts as they age.

Memory Decay / Forgetting

Memory decay is the deliberate design choice to let stored memories weaken or expire over time, rather than keeping everything forever. It can be based on age, how often a memory is used, or relevance. Intentional forgetting keeps memory stores from filling with outdated or trivial details and mirrors how human memory fades.

Analogy: it is comparable to weeding a garden; without it, the useful plants get crowded out by everything that ever grew.

Incognito / Temporary Chat

An incognito or temporary chat is a session that is deliberately excluded from persistent memory, so nothing said in it is saved or used to personalize later conversations. It gives users a way to ask one-off or sensitive questions without shaping their long-term profile. When the session ends, its contents are meant to be discarded.

Training Data

Training data is the large body of text a model learns from during its initial development, which shapes its general knowledge and writing ability. It is fixed once training is complete and is distinct from anything you say during a conversation. A model's built-in knowledge, and its blind spots, trace back to what was and was not in this data.

Fine-Tuning

Fine-tuning is further training an already-trained model on a narrower set of examples to specialize its behavior, tone, or domain knowledge. It permanently changes the model's weights, unlike memory or retrieval, which add information at the moment of use. Because it alters the model itself, fine-tuning is a heavier, slower way to add knowledge than storing a fact in memory.

Analogy: memory is handing someone a reference sheet; fine-tuning is sending them back to school.

Training Opt-Out

A training opt-out is a setting or policy that keeps your conversations from being used to train or improve future models. It governs whether your data feeds back into the model's development, separately from whether it is retained for your own memory. Availability and defaults vary widely between providers, so the specifics are worth checking for any service you use.

Data Retention

Data retention is how long a provider keeps your conversations and stored data before deleting them. Retention policies determine what still exists to be exported, recalled, or, if you request it, erased. A short retention window limits exposure but also limits how far back memory and history can reach.

Memory Export

Memory export is a feature that lets you download the facts a system has stored about you, usually as a file. It supports transparency and data-portability rights, letting you see and take away what an AI remembers. Export is often paired with the ability to review, edit, or delete individual memories.

Definitions above are vendor-neutral and reflect common industry usage as of July 2026.