If you haven’t already, subscribe and join our community in receiving weekly AI insights, updates and interviews with industry experts straight to your feed.
We all like to be remembered. When you bump into someone you’ve only met once or twice before, and they remember your name or recall something interesting you said – it feels good. It shows you they paid attention, and in turn that makes you feel like you’re worthy of attention.
For most of its life, AI hasn’t really done that. It’s been impressive and fluent, but fundamentally forgetful. Users have known that each new interaction will start from scratch, with no true continuity.
But (as we’re sure you already know) that’s changing. Over the last five years, the development of AI memory has enabled us to reimagine what AI systems might become in the future.
Previously, the dominant form of memory in large language models was parametric memory: knowledge encoded in model weights during training. It’s described explicitly in this RAG paper, which contrasts parametric memory with non-parametric memory stored externally. Parametric memory is powerful, but it’s also hard to update, hard to audit, and not naturally tied to source attribution.
The first big turn in the modern LLM era was that leading systems stopped relying on weights alone.
In 2020, Retrieval-Augmented Generation formalised a hybrid approach: generation grounded partly in internal model knowledge and partly in retrieved external documents. The paper describes RAG as combining pre-trained parametric and non-parametric memory for language generation.
This marks the moment memory becomes a system design problem, not just a training outcome.
A year later, DeepMind’s RETRO pushed that argument much further. RETRO retrieves text chunks from a database of roughly 2 trillion tokens and reports performance on the Pile (a dataset of roughly 825GB raw text and 300B tokens created by EleutherAI in 2020), comparable to GPT-3 and Jurassic-1 while using 25 times fewer parameters.
The point wasn’t just that retrieval helps factual QA; it was that access to external memory could be a viable scaling path in its own right.
That changed the tone of the field. Memory was no longer just a limitation to work around. It was becoming an architectural lever.
By 2023, the idea expanded again. In work on generative agents by Park et al., memory is not treated simply as factual lookup. The system stores a natural-language record of experiences, retrieves memories dynamically, and synthesises higher-level reflections that influence future planning and behaviour. The authors show that observation, planning and reflection each contribute materially to believable agent behaviour in their simulated environment.
The really interesting development here was that researchers began operationalising memory as accumulated experience plus retrieval plus synthesis. In other words, memory became a mechanism for stateful behaviour over time, not just recall.
That created a new engineering problem: how do you manage memory when the context window is finite?
This is where work like MemGPT became influential. MemGPT frames the challenge as virtual context management, drawing inspiration from hierarchical memory systems in operating systems. Its contribution is not literal RAM-and-disk emulation, but an OS-style approach to moving information between memory tiers so the model can appear to have access to more context than fits in the active window.
That helped to clarify a broader trend: retrieval, summarisation, prioritisation, and state management were becoming core parts of LLM system design.
Then came the leap in context length.
Google’s February 2024 Gemini 1.5 announcement described a breakthrough in long-context understanding, with Gemini 1.5 Pro entering testing with an experimental 1 million token context window; by May 2024 Google said 2 million tokens would be available in private preview, and by late June it opened 2 million token context to all developers.
Anthropic’s Claude 3 family launched with a 200K context window and said the models were capable of inputs exceeding 1 million tokens for select customers.
This is where precision matters most. A larger context window is not the same thing as long-term memory. It is better understood as working memory at inference time: the system can hold and reason over much more information during the current interaction, but that information does not inherently persist beyond the session. For users, though, the experience is transformative. Entire codebases, research packets, or hours of multimodal input can sit inside one active reasoning frame.
The most visible recent shift is that memory has moved from research architecture into product design.
In February 2024, OpenAI said it was testing memory in ChatGPT so the system could remember things discussed across chats, and emphasised that users could ask what it remembers, tell it to forget, or turn the feature off.
OpenAI’s later help documentation distinguishes between at least two mechanisms: reference saved memories, which are explicit durable details, and reference chat history, which lets ChatGPT draw more broadly on past conversations.
That shows product memory is no longer a single bucket. It is already being split into layers of persistence, recall and user control.
If we look at this all together, we can see that the real change over these last five years is that AI memory has become multi-layered.
So beyond model specs, we’re no longer just building systems that contain knowledge.
We are building systems that manage state across time, tools, documents and users. That changes usability, yes – but it also changes expectations around privacy, trust, provenance and control. The technical questions increasingly focus on what should the system remember, when, and under whose control?
Open this newsletter on LinkedIn and tell us what you think: what should AI remember, and what should it forget?
We’ll see you back here next week.
AI and artists collide again – so what’s next?
AI and artists collide again – so what’s next?