Workshop

Add Learning Memory to AI Agents with Hindsight

Build an agent that improves over time using state-of-the-art long-term memory

35 min agent-memory long-term-memory LongMemEval RAG-alternative Python-SDK

What's happening

Most agent memory systems are glorified search engines over conversation logs. They store what was said and retrieve it when prompted — a pattern that amounts to sophisticated copy-paste. Hindsight, a new open-source project by Vectorize, takes a fundamentally different approach: it builds a system where agents learn from interactions rather than merely replaying them.

The distinction matters. Consider a human assistant who has worked with you for a year. They do not consult transcripts of your past conversations before responding; they have internalized your preferences, your communication style, and the context of your work. Hindsight attempts to give AI agents a comparable capacity.

The project has achieved state-of-the-art performance on the LongMemEval benchmark — a widely used evaluation framework for conversational AI memory systems — outperforming both retrieval-augmented generation (RAG) and knowledge-graph approaches. These results have been independently reproduced by researchers at the Virginia Tech Sanghani Center for AI and Data Analytics and The Washington Post, lending credibility that extends beyond vendor self-reporting.

Hindsight is already deployed in production at Fortune 500 enterprises. In this session, we will stand up a local Hindsight instance, integrate it into an LLM-powered agent, and explore the retain → recall → reflect loop that constitutes its core architecture.

Understand the Architecture: Learning vs. Remembering

Before we touch any tooling, we need to understand what makes Hindsight architecturally distinct from the memory approaches you may already know.

RAG-based memory stores raw conversation chunks in a vector database and retrieves them via semantic similarity search. This works well for factual lookups but fails when the agent needs to synthesize insights across many interactions — it retrieves fragments, not understanding.

Knowledge-graph memory extracts entities and relationships into a structured graph. This captures connections but struggles with nuance, context-dependent meaning, and the kind of soft preferences that characterize real human communication.

Hindsight introduces a three-phase loop:

Retain — The agent stores information from an interaction. This is not raw transcript storage; the system processes and structures the information for future learning.
Recall — The agent queries stored memories using natural language. Results are ranked by relevance to the current conversational context.
Reflect — The agent generates a disposition-aware response. This is the key differentiator: rather than returning raw search results, Hindsight synthesizes retrieved memories into a contextually appropriate answer that accounts for the agent's accumulated understanding.

The term "disposition-aware" deserves attention. In this context, disposition refers to the agent's learned orientation toward the user — their preferences, communication patterns, and contextual expectations. When reflecting, the system does not simply find relevant memories; it constructs a response shaped by everything the agent has learned about the user across all prior interactions.

Tip

A useful mental model: Recall is like searching your notes. Reflect is like thinking about what you know. The difference is synthesis — reflect produces understanding, not search results.

Why does RAG fall short for agent memory?

RAG retrieves document chunks based on embedding similarity, which works well for factual question-answering over a static corpus. However, agent memory requires several capabilities RAG does not naturally provide:

Temporal reasoning — understanding that information from yesterday may supersede information from last month.
Preference aggregation — synthesizing patterns across dozens of interactions (e.g., the user consistently prefers concise responses).
Contextual disambiguation — recognizing that 'the project' means different things in different conversational contexts.
Graceful contradiction handling — managing cases where the user's stated preferences have evolved over time.

Knowledge graphs address some of these issues but introduce their own limitations: they require explicit schema design, struggle with unstructured or ambiguous information, and typically demand significant engineering effort to maintain. Hindsight's approach attempts to combine the flexibility of unstructured storage with the reasoning capacity of structured systems.

Stand Up Hindsight Locally with Docker

Hindsight runs as a self-contained service via Docker, bundling its own PostgreSQL instance for memory storage. The architecture exposes two interfaces: an API server on port 8888 for programmatic access, and a web UI on port 9999 for visual inspection of stored memories.

The service requires an LLM provider API key because Hindsight uses a language model internally — not for generating end-user responses, but for processing and structuring memories during the retain and reflect phases. It supports multiple providers including OpenAI, Anthropic, Gemini, Groq, Ollama, and LM Studio.

We will ask our coding agent to produce the exact Docker launch command and verify the service is running.

Ask your agent

Get the agent to produce a Docker command that launches Hindsight locally, configured for your preferred LLM provider.

Think about it

What LLM provider will you use, and how should the API key be passed to the container?
Which ports need to be exposed, and what do they map to?
How should data persist between container restarts — what volume mount is needed?
What environment variable controls the LLM provider selection if you are not using OpenAI?

What the agent gives back

The agent should produce a single docker run command that pulls the latest Hindsight image, maps ports 8888 and 9999, passes your LLM API key as an environment variable, and mounts a local volume for PostgreSQL data persistence. If you are using a provider other than OpenAI, the command should also set HINDSIGHT_API_LLM_PROVIDER. The agent should also suggest a quick verification step — such as visiting http://localhost:9999 in a browser or issuing a curl request to the API health endpoint.

API Key Note

You will need an API key from your chosen LLM provider (OpenAI, Anthropic, etc.). This key is used by Hindsight internally for memory processing — it is separate from any key your own agent application uses.

Warning

The Docker image bundles PostgreSQL internally. For production deployments, Hindsight also supports connecting to an external PostgreSQL instance via Docker Compose. The bundled configuration is appropriate for development and this workshop.

Why does a memory system need an LLM?

This is a reasonable question — if Hindsight is a memory layer, why does it need its own LLM access?

The answer lies in the retain and reflect operations. When information is retained, Hindsight does not simply store the raw text. It uses an LLM to extract structured knowledge, identify relationships, and determine how new information relates to existing memories. Similarly, during reflection, the LLM synthesizes retrieved memories into a coherent, disposition-aware response.

This is fundamentally different from a vector database, which stores and retrieves embeddings without understanding their content. Hindsight's LLM usage is an internal implementation detail — your application's LLM calls remain separate and under your control.

✓

At this point, you should have Hindsight running locally. Verify by opening **http://localhost:9999** in your browser — you should see the Hindsight web UI. The API at **http://localhost:8888** should also respond to requests.

Wire Hindsight into an Agent: The Retain–Recall–Reflect Loop

With Hindsight running, we now integrate it into an agent application. Hindsight offers two integration paths:

The LLM Wrapper — a drop-in replacement for your existing LLM client that automatically handles memory storage and retrieval. This requires changing approximately two lines of code in an existing application.
The explicit API — direct calls to retain, recall, and reflect endpoints, giving you fine-grained control over when and how memories are managed.

We will use the explicit API approach in this workshop because it makes the memory lifecycle visible and comprehensible. The wrapper is more convenient for production use, but the explicit API reveals the mechanics we need to understand.

The central organizing concept is the memory bank — identified by a bank_id string. A memory bank is a namespace for memories, analogous to a database schema. You might create one bank per user, per project, or per conversational domain. All retain, recall, and reflect operations are scoped to a specific bank.

The workflow for a memory-augmented agent follows this pattern:

On receiving user input: call recall or reflect with the user's message as the query, retrieving relevant context from prior interactions.
Augment the LLM prompt: include the retrieved memories alongside the user's current message.
After generating a response: call retain with the key information from the interaction — both what the user said and what the agent learned.

Ask your agent

Get the agent to build a simple Python assistant that uses the Hindsight client to retain information from conversations and reflect on stored memories when answering questions.

Think about it

What should the agent's conversational loop look like — how does it receive input, consult memory, and respond?
When should the agent call `retain` versus `reflect`? Think about which direction information flows in each case.
How should the memory bank be identified — one bank per user, per session, or per topic?
What information is worth retaining from each exchange? Raw transcripts, or extracted facts and preferences?

What the agent gives back

The agent should produce a command-line assistant that initializes a Hindsight client pointing at localhost:8888, creates a named memory bank, and implements a conversational loop. On each turn, it calls reflect with the user's query to retrieve disposition-aware context, passes that context plus the user's message to an LLM for response generation, then calls retain with the salient information from the exchange. The key architectural insight is that reflect returns synthesized knowledge (not raw chunks), and retain stores processed information (not raw transcripts). The application should be roughly 30–40 lines of Python, but the essential pattern is just the three API calls orchestrated within the conversation loop.

Tip

When prompting your agent, emphasize that you want the explicit API (retain/recall/reflect calls), not the LLM wrapper. The wrapper hides the mechanics, which defeats the purpose of learning how the system works.

Recall vs. Reflect: when to use which

Both recall and reflect retrieve information from memory, but they serve different purposes:

Recall performs a search and returns matching memory entries ranked by relevance. The results are raw — you receive the stored information as-is and must incorporate it into your prompt yourself. Use recall when you need fine-grained control over how memories are presented to the LLM, or when you want to inspect what the system has stored.

Reflect performs the same retrieval but adds a synthesis step: it uses an LLM to generate a coherent, contextually appropriate summary of the relevant memories. The output is a disposition-aware response — it accounts for the full context of what the agent has learned about the user. Use reflect when you want Hindsight to do the heavy lifting of memory integration.

For most agent applications, reflect is the more useful operation. Recall is valuable for debugging, auditing, and cases where you need to present specific memories to the user.

Memory bank design strategies

The choice of how to partition memory banks has significant implications:

Per-user banks — the most common pattern. Each user gets their own bank, and the agent learns about each user independently. This is appropriate for personal assistants and customer service agents.
Per-project banks — useful for agents that manage work across multiple projects. The agent can recall project-specific context without cross-contamination.
Shared banks — all users contribute to and draw from a single memory pool. This is appropriate for agents that need collective knowledge, such as a team knowledge base or a shared FAQ system.
Hierarchical banks — combining approaches (e.g., a per-user bank plus a shared organizational bank). The agent queries both and merges the results.

For this workshop, a single bank is sufficient. In production, the bank architecture should reflect your application's domain model.

Quick Check

Your agent needs to answer a question about a user's dietary preferences, which they mentioned across three separate conversations last week. Which Hindsight operation is most appropriate?

Retain — store the current question for future reference

✗ Not quite. Retain is for storing new information, not retrieving existing knowledge. The dietary preferences are already stored; we need to access them.

Recall — search for memory entries related to dietary preferences

✗ Not quite. Recall would return individual memory entries, which would then need manual synthesis. It is useful for debugging but adds unnecessary complexity for this use case.

Reflect — generate a disposition-aware synthesis of what the agent knows about the user's dietary preferences

✓ Correct! Reflect is the appropriate choice. It retrieves relevant memories and synthesizes them into a coherent understanding — exactly what is needed when the answer spans multiple past interactions. The disposition-aware synthesis will account for any evolution in the user's preferences over time.

Demonstrate Cross-Session Learning

The real test of an agent memory system is not whether it can retrieve information from the current session — any context window can do that. The test is whether the agent demonstrably improves across sessions, exhibiting behavior that reflects accumulated understanding rather than pattern-matching against retrieved text.

We will now conduct a structured experiment: interact with our assistant across multiple simulated sessions, introducing information gradually, and then verify that the agent can synthesize knowledge it acquired across separate conversations.

This exercise illustrates the critical distinction between remembering and learning:

An agent that remembers can tell you what was said in a previous conversation.
An agent that learns can draw inferences from information spread across multiple conversations, even when those inferences were never explicitly stated.

For example, if in session one you mention you are vegetarian, and in session two you mention you are hosting a dinner party, a learning agent should proactively consider vegetarian menu options in session three — without being reminded of the dietary constraint.

Ask your agent

Get the agent to write a test script that simulates three separate sessions with the assistant, seeding specific information in each, then poses a question in a fourth session that requires synthesizing knowledge from all three.

Think about it

What information should each simulated session introduce? Think about facts that are independently unremarkable but collectively meaningful.
How do you simulate 'separate sessions' — what changes between sessions and what persists?
What question in the final session would be impossible to answer without cross-session synthesis?
How will you verify the agent's response demonstrates genuine learning rather than lucky retrieval?

What the agent gives back

The agent should produce a test script that makes a series of API calls simulating distinct conversational sessions. Each session retains different pieces of information into the same memory bank — for instance, session one establishes the user's role, session two introduces a current project, and session three mentions a deadline. The final session calls reflect with a question that requires combining all three facts (e.g., 'What should I prioritize this week?'). The script should print the reflect response and include a brief validation check — confirming that the response references information from all three prior sessions. The key insight is that the memory bank persists across sessions while conversational context does not.

Tip

When seeding your test sessions, choose facts that are complementary but not redundant. The synthesis question should require combining information, not merely retrieving a single fact. This is what separates memory systems that learn from those that merely search.

How LongMemEval tests exactly this capability

The LongMemEval benchmark is specifically designed to evaluate the kind of cross-session synthesis we are testing here. It presents memory systems with extended conversational histories and then poses questions that require:

Single-hop retrieval — finding a specific fact mentioned once in a long history.
Multi-hop reasoning — combining facts from different parts of the history to derive an answer.
Temporal reasoning — understanding which information is current versus outdated.
Preference tracking — recognizing patterns in user behavior and preferences over time.
Knowledge update — correctly handling cases where later information contradicts earlier statements.

Hindsight's state-of-the-art performance on this benchmark indicates that its retain–reflect architecture handles these diverse memory tasks more effectively than systems based on RAG or knowledge graphs. The independent reproduction of these results by Virginia Tech and The Washington Post lends additional confidence to the claims.

✓

You should now have a working assistant that demonstrably retains information across sessions and synthesizes it when answering questions. Run your test script and verify that the final `reflect` response incorporates knowledge from all simulated sessions.

The LLM Wrapper: Production-Grade Integration

Having understood the explicit API, we can now appreciate the convenience of Hindsight's LLM Wrapper — a drop-in replacement for your existing LLM client that handles retain and reflect automatically behind the scenes.

The wrapper intercepts your standard LLM API calls and transparently:

Calls reflect before each LLM request, enriching the prompt with relevant memories.
Calls retain after each response, storing salient information from the exchange.

This means an existing agent application can gain long-term memory by changing its client initialization — no modification to the conversational logic, prompt templates, or response handling.

The trade-off is control. The wrapper makes decisions about what to retain and when to reflect that may not match every application's needs. For many use cases, these defaults are appropriate. For applications with specific memory management requirements — selective retention, multiple memory banks, or custom reflection triggers — the explicit API remains the better choice.

Ask your agent

Get the agent to refactor your explicit-API assistant to use the LLM Wrapper instead, reducing the memory management code to approximately two lines.

Think about it

What changes in the client initialization when switching to the wrapper?
Which parts of your existing code become unnecessary — what does the wrapper handle for you?
How does the wrapper know which memory bank to use?
What do you lose by using the wrapper instead of explicit API calls?

What the agent gives back

The agent should produce a simplified version of the assistant where the Hindsight-wrapped LLM client replaces both the standard LLM client and the explicit retain/recall/reflect calls. The conversational loop should shrink significantly — the agent simply sends messages through the wrapped client, and memory management happens automatically. The agent should note that the wrapper approach is ideal for rapid prototyping and standard use cases, while the explicit API is preferable when you need fine-grained control over the memory lifecycle.

Tip

The wrapper pattern is not unique to Hindsight — it reflects a broader trend in AI tooling where infrastructure concerns (memory, caching, observability) are handled via transparent client wrappers rather than explicit application code. Understanding this pattern prepares you for similar integrations with other tools.

Your Turn

Build a domain-specific learning agent: a technical support assistant that remembers each user's system configuration, past issues, and resolution history — and uses that knowledge to provide increasingly personalized troubleshooting guidance.

A meaningful test of agent memory is whether it can improve at a specific professional task. Technical support is an excellent domain because it involves recurring users with persistent configurations, evolving issues that build on prior context, and resolution patterns that benefit from historical knowledge. Your agent should demonstrate that its third interaction with a user is measurably more efficient than its first.

Think about it

How should memory banks be structured — per user, per product, or per issue category? What are the trade-offs?
What information from each support interaction is worth retaining? Consider the difference between symptoms, diagnoses, and resolutions.
How should the agent use `reflect` at the start of each interaction to establish context before the user has described their current issue?
How would you verify that the agent's performance genuinely improves over successive interactions — what metrics or observable behaviors would demonstrate learning?

See a sample prompt

One way you could prompt it

Build a Python technical support assistant using the Hindsight client library (hindsight-client, connecting to localhost:8888). Structure it as follows:

1. Each user gets their own memory bank (bank_id = user's name or ID).
2. At the start of each interaction, call reflect with a general context query like 'What do I know about this user's system and past issues?' and include the synthesis in the system prompt.
3. During the conversation, when the user describes an issue, call recall to check for similar past issues and their resolutions.
4. After resolving an issue, call retain with structured information: the symptom, the diagnosis, the resolution, and the user's system configuration details mentioned during the exchange.
5. Include a simple test sequence: simulate three interactions with the same user — first establishing their system config, second resolving a basic issue, third presenting a related issue where the agent should proactively reference the prior resolution and known configuration.

The key behavior to demonstrate: by the third interaction, the agent should reference the user's known configuration and prior issue history without being reminded.

Quick Check

You are building an agent that serves a team of 20 people working on the same project. How should you design the memory bank architecture?

One shared memory bank for all team members

✗ Not quite. A single shared bank would blend individual preferences with shared project knowledge, making it difficult for the agent to personalize responses. It also risks surfacing one team member's private context to another.

One memory bank per team member, with no shared knowledge

✗ Not quite. Purely individual banks would mean the agent cannot leverage collective project knowledge. If one team member resolves a technical issue, that knowledge would not be available when another team member encounters the same problem.

A per-user bank for individual context plus a shared project bank for collective knowledge, querying both during reflect

✓ Correct! This hierarchical approach preserves individual personalization while enabling collective learning. The agent queries both banks and merges the context — personal preferences come from the individual bank, project knowledge comes from the shared bank. This mirrors how human teams operate: shared documentation plus individual expertise.

Recap

In this session, we moved from understanding the theoretical limitations of existing agent memory approaches — RAG and knowledge graphs — to deploying and integrating a system that addresses those limitations.

We stood up Hindsight locally via Docker, explored its three core operations (retain, recall, reflect), built an assistant that uses the explicit API to manage memories, verified cross-session learning through a structured test, and examined the LLM Wrapper pattern for production-grade integration.

The central insight is the distinction between remembering and learning. An agent that remembers can retrieve what was said. An agent that learns can synthesize understanding from information distributed across many interactions, draw inferences that were never explicitly stated, and adapt its behavior based on accumulated knowledge. Hindsight's reflect operation — with its disposition-aware synthesis — is the mechanism that bridges this gap.

Agent memory is not retrieval — it is synthesis. The retain→recall→reflect loop separates Hindsight from RAG-based approaches by producing understanding, not search results.
Disposition-aware reflection means the agent's response is shaped by everything it has learned about the user, not just the memories most similar to the current query.
Memory bank architecture is a design decision with significant implications — per-user, per-project, shared, or hierarchical banks serve different application patterns.
The LLM Wrapper pattern demonstrates a broader trend: infrastructure concerns handled via transparent client wrappers, enabling adoption with minimal code changes.
Independent benchmark reproduction (Virginia Tech, Washington Post) provides meaningful validation beyond vendor self-reporting — always look for this when evaluating AI tools.

Where to go next

Explore Hindsight's web UI at localhost:9999 to inspect stored memories and understand what the system retains from your interactions.
Read the Hindsight research paper to understand the formal architecture behind disposition-aware memory retrieval.
Experiment with the LongMemEval benchmark yourself to develop intuition for the types of memory tasks that distinguish systems.
Investigate multi-bank architectures for a real application — the design of memory bank boundaries is often the most consequential architectural decision in memory-augmented agents.

Sources

vectorize-io/hindsight (GitHub Trending Python)