Most agent memory systems are glorified search engines over conversation logs. They store what was said and retrieve it when prompted — a pattern that amounts to sophisticated copy-paste. Hindsight, a new open-source project by Vectorize, takes a fundamentally different approach: it builds a system where agents learn from interactions rather than merely replaying them.
The distinction matters. Consider a human assistant who has worked with you for a year. They do not consult transcripts of your past conversations before responding; they have internalized your preferences, your communication style, and the context of your work. Hindsight attempts to give AI agents a comparable capacity.
The project has achieved state-of-the-art performance on the LongMemEval benchmark — a widely used evaluation framework for conversational AI memory systems — outperforming both retrieval-augmented generation (RAG) and knowledge-graph approaches. These results have been independently reproduced by researchers at the Virginia Tech Sanghani Center for AI and Data Analytics and The Washington Post, lending credibility that extends beyond vendor self-reporting.
Hindsight is already deployed in production at Fortune 500 enterprises. In this session, we will stand up a local Hindsight instance, integrate it into an LLM-powered agent, and explore the retain → recall → reflect loop that constitutes its core architecture.
Before we touch any tooling, we need to understand what makes Hindsight architecturally distinct from the memory approaches you may already know.
RAG-based memory stores raw conversation chunks in a vector database and retrieves them via semantic similarity search. This works well for factual lookups but fails when the agent needs to synthesize insights across many interactions — it retrieves fragments, not understanding.
Knowledge-graph memory extracts entities and relationships into a structured graph. This captures connections but struggles with nuance, context-dependent meaning, and the kind of soft preferences that characterize real human communication.
Hindsight introduces a three-phase loop:
The term "disposition-aware" deserves attention. In this context, disposition refers to the agent's learned orientation toward the user — their preferences, communication patterns, and contextual expectations. When reflecting, the system does not simply find relevant memories; it constructs a response shaped by everything the agent has learned about the user across all prior interactions.
RAG retrieves document chunks based on embedding similarity, which works well for factual question-answering over a static corpus. However, agent memory requires several capabilities RAG does not naturally provide:
Knowledge graphs address some of these issues but introduce their own limitations: they require explicit schema design, struggle with unstructured or ambiguous information, and typically demand significant engineering effort to maintain. Hindsight's approach attempts to combine the flexibility of unstructured storage with the reasoning capacity of structured systems.
Hindsight runs as a self-contained service via Docker, bundling its own PostgreSQL instance for memory storage. The architecture exposes two interfaces: an API server on port 8888 for programmatic access, and a web UI on port 9999 for visual inspection of stored memories.
The service requires an LLM provider API key because Hindsight uses a language model internally — not for generating end-user responses, but for processing and structuring memories during the retain and reflect phases. It supports multiple providers including OpenAI, Anthropic, Gemini, Groq, Ollama, and LM Studio.
We will ask our coding agent to produce the exact Docker launch command and verify the service is running.
The agent should produce a single docker run command that pulls the latest Hindsight image, maps ports 8888 and 9999, passes your LLM API key as an environment variable, and mounts a local volume for PostgreSQL data persistence. If you are using a provider other than OpenAI, the command should also set HINDSIGHT_API_LLM_PROVIDER. The agent should also suggest a quick verification step — such as visiting http://localhost:9999 in a browser or issuing a curl request to the API health endpoint.
This is a reasonable question — if Hindsight is a memory layer, why does it need its own LLM access?
The answer lies in the retain and reflect operations. When information is retained, Hindsight does not simply store the raw text. It uses an LLM to extract structured knowledge, identify relationships, and determine how new information relates to existing memories. Similarly, during reflection, the LLM synthesizes retrieved memories into a coherent, disposition-aware response.
This is fundamentally different from a vector database, which stores and retrieves embeddings without understanding their content. Hindsight's LLM usage is an internal implementation detail — your application's LLM calls remain separate and under your control.
With Hindsight running, we now integrate it into an agent application. Hindsight offers two integration paths:
We will use the explicit API approach in this workshop because it makes the memory lifecycle visible and comprehensible. The wrapper is more convenient for production use, but the explicit API reveals the mechanics we need to understand.
The central organizing concept is the memory bank — identified by a bank_id string. A memory bank is a namespace for memories, analogous to a database schema. You might create one bank per user, per project, or per conversational domain. All retain, recall, and reflect operations are scoped to a specific bank.
The workflow for a memory-augmented agent follows this pattern:
recall or reflect with the user's message as the query, retrieving relevant context from prior interactions.retain with the key information from the interaction — both what the user said and what the agent learned.The agent should produce a command-line assistant that initializes a Hindsight client pointing at localhost:8888, creates a named memory bank, and implements a conversational loop. On each turn, it calls reflect with the user's query to retrieve disposition-aware context, passes that context plus the user's message to an LLM for response generation, then calls retain with the salient information from the exchange. The key architectural insight is that reflect returns synthesized knowledge (not raw chunks), and retain stores processed information (not raw transcripts). The application should be roughly 30–40 lines of Python, but the essential pattern is just the three API calls orchestrated within the conversation loop.
Both recall and reflect retrieve information from memory, but they serve different purposes:
Recall performs a search and returns matching memory entries ranked by relevance. The results are raw — you receive the stored information as-is and must incorporate it into your prompt yourself. Use recall when you need fine-grained control over how memories are presented to the LLM, or when you want to inspect what the system has stored.
Reflect performs the same retrieval but adds a synthesis step: it uses an LLM to generate a coherent, contextually appropriate summary of the relevant memories. The output is a disposition-aware response — it accounts for the full context of what the agent has learned about the user. Use reflect when you want Hindsight to do the heavy lifting of memory integration.
For most agent applications, reflect is the more useful operation. Recall is valuable for debugging, auditing, and cases where you need to present specific memories to the user.
The choice of how to partition memory banks has significant implications:
For this workshop, a single bank is sufficient. In production, the bank architecture should reflect your application's domain model.
The real test of an agent memory system is not whether it can retrieve information from the current session — any context window can do that. The test is whether the agent demonstrably improves across sessions, exhibiting behavior that reflects accumulated understanding rather than pattern-matching against retrieved text.
We will now conduct a structured experiment: interact with our assistant across multiple simulated sessions, introducing information gradually, and then verify that the agent can synthesize knowledge it acquired across separate conversations.
This exercise illustrates the critical distinction between remembering and learning:
For example, if in session one you mention you are vegetarian, and in session two you mention you are hosting a dinner party, a learning agent should proactively consider vegetarian menu options in session three — without being reminded of the dietary constraint.
The agent should produce a test script that makes a series of API calls simulating distinct conversational sessions. Each session retains different pieces of information into the same memory bank — for instance, session one establishes the user's role, session two introduces a current project, and session three mentions a deadline. The final session calls reflect with a question that requires combining all three facts (e.g., 'What should I prioritize this week?'). The script should print the reflect response and include a brief validation check — confirming that the response references information from all three prior sessions. The key insight is that the memory bank persists across sessions while conversational context does not.
The LongMemEval benchmark is specifically designed to evaluate the kind of cross-session synthesis we are testing here. It presents memory systems with extended conversational histories and then poses questions that require:
Hindsight's state-of-the-art performance on this benchmark indicates that its retain–reflect architecture handles these diverse memory tasks more effectively than systems based on RAG or knowledge graphs. The independent reproduction of these results by Virginia Tech and The Washington Post lends additional confidence to the claims.
Having understood the explicit API, we can now appreciate the convenience of Hindsight's LLM Wrapper — a drop-in replacement for your existing LLM client that handles retain and reflect automatically behind the scenes.
The wrapper intercepts your standard LLM API calls and transparently:
reflect before each LLM request, enriching the prompt with relevant memories.retain after each response, storing salient information from the exchange.This means an existing agent application can gain long-term memory by changing its client initialization — no modification to the conversational logic, prompt templates, or response handling.
The trade-off is control. The wrapper makes decisions about what to retain and when to reflect that may not match every application's needs. For many use cases, these defaults are appropriate. For applications with specific memory management requirements — selective retention, multiple memory banks, or custom reflection triggers — the explicit API remains the better choice.
The agent should produce a simplified version of the assistant where the Hindsight-wrapped LLM client replaces both the standard LLM client and the explicit retain/recall/reflect calls. The conversational loop should shrink significantly — the agent simply sends messages through the wrapped client, and memory management happens automatically. The agent should note that the wrapper approach is ideal for rapid prototyping and standard use cases, while the explicit API is preferable when you need fine-grained control over the memory lifecycle.
Build a Python technical support assistant using the Hindsight client library (hindsight-client, connecting to localhost:8888). Structure it as follows:
1. Each user gets their own memory bank (bank_id = user's name or ID).
2. At the start of each interaction, call reflect with a general context query like 'What do I know about this user's system and past issues?' and include the synthesis in the system prompt.
3. During the conversation, when the user describes an issue, call recall to check for similar past issues and their resolutions.
4. After resolving an issue, call retain with structured information: the symptom, the diagnosis, the resolution, and the user's system configuration details mentioned during the exchange.
5. Include a simple test sequence: simulate three interactions with the same user — first establishing their system config, second resolving a basic issue, third presenting a related issue where the agent should proactively reference the prior resolution and known configuration.
The key behavior to demonstrate: by the third interaction, the agent should reference the user's known configuration and prior issue history without being reminded.
In this session, we moved from understanding the theoretical limitations of existing agent memory approaches — RAG and knowledge graphs — to deploying and integrating a system that addresses those limitations.
We stood up Hindsight locally via Docker, explored its three core operations (retain, recall, reflect), built an assistant that uses the explicit API to manage memories, verified cross-session learning through a structured test, and examined the LLM Wrapper pattern for production-grade integration.
The central insight is the distinction between remembering and learning. An agent that remembers can retrieve what was said. An agent that learns can synthesize understanding from information distributed across many interactions, draw inferences that were never explicitly stated, and adapt its behavior based on accumulated knowledge. Hindsight's reflect operation — with its disposition-aware synthesis — is the mechanism that bridges this gap.