Local Deep Research (LDR) has emerged as one of the most capable open-source research assistants available, recently achieving approximately 95% accuracy on the SimpleQA benchmark when paired with GPT-4.1-mini. What distinguishes LDR from conventional retrieval-augmented generation systems is its agentic architecture: rather than following a fixed pipeline — search, retrieve, summarize — LDR's new LangGraph agent strategy allows the underlying language model to autonomously decide which specialized search engines to query, when to switch between them, and when enough evidence has been gathered to synthesize a response.
This matters for a specific reason. Most AI-assisted research tools treat search as a single step: the user asks a question, the system queries one source, and returns a summary. LDR treats search as an iterative reasoning process. It might begin with a broad web search, recognize from the results that the question is biomedical in nature, pivot to PubMed, discover a relevant preprint reference, query arXiv for it, and only then synthesize — all without human intervention.
Equally significant is the personal knowledge base. Every research session produces sources — papers, articles, web pages. LDR allows you to download these directly into an encrypted, indexed library. Future queries then search your accumulated library alongside the live web. The knowledge compounds: each session makes the next one richer.
In this workshop, we will deploy LDR as a fully local stack (Ollama for the LLM, SearXNG for web search), explore its agent strategy architecture, and build a personal knowledge base that grows with use. Everything runs on your hardware, encrypted with AES-256. No data leaves your machine unless you choose a cloud LLM provider.
The LDR stack comprises three services: an LLM inference server (Ollama), a meta-search engine (SearXNG), and the LDR application itself. These services communicate over a shared Docker network, forming a self-contained research pipeline.
The architecture follows a pattern common in modern AI applications: separation of inference from orchestration. Ollama handles token generation, SearXNG handles search federation across dozens of upstream engines, and LDR orchestrates the research workflow — deciding what to search, interpreting results, and producing structured reports.
Rather than manually writing Docker Compose configurations, we will have our AI agent generate a deployment script tailored to our hardware. This is a useful exercise in specifying infrastructure requirements through natural language.
The agent should produce a Docker Compose file defining three services — ollama (port 11434), searxng (port 8080), and local-deep-research (port 5000) — connected via a shared bridge network. Ollama should have a named volume for model storage. LDR should have a named volume mounted at /data with the LDR_DATA_DIR environment variable set accordingly. The agent should also provide the one-line command to pull a suitable model into Ollama after startup. For GPU users, the agent should include the NVIDIA runtime configuration as a separate override file or a noted modification.
gpt-oss:20b needs approximately 12–16 GB of RAM (or VRAM for GPU inference). If your machine has limited resources, substitute a smaller model such as llama3.2:3b or phi3:mini. Research quality will decrease, but the workflow remains functional.
Ollama provides a standardized OpenAI-compatible API layer over local model inference. This means LDR can use the same client code regardless of whether the backend is a local 7B model or a cloud-hosted GPT-4. The abstraction also allows hot-swapping models without reconfiguring LDR — you simply change the model name in settings. For privacy-sensitive research, local inference ensures that your queries never leave your network. The trade-off is inference speed: a local 20B model on consumer hardware will be substantially slower than a cloud API call, but the latency is acceptable for research workflows where you are willing to wait 30–60 seconds for a thorough answer.
LDR offers over 20 research strategies, ranging from simple factual lookup to deep multi-source analysis. The most sophisticated is the LangGraph agent strategy, which represents a fundamental shift from pipeline-based to agent-based research.
In a pipeline strategy, the execution path is predetermined: receive query → search web → collect results → summarize. The system follows the same steps regardless of the query's nature. In the agent strategy, the LLM itself becomes the orchestrator. It receives the query, reasons about which information sources are most relevant, issues tool calls to specialized search engines, evaluates the returned results, and decides whether to search again, switch engines, or synthesize.
This is implemented using LangGraph, a framework for building stateful, multi-step agent workflows as directed graphs. Each node in the graph represents a capability — web search, arXiv query, PubMed lookup, document retrieval — and the LLM navigates between nodes based on its assessment of what information is still needed.
The practical consequence is that the agent strategy tends to collect significantly more sources and produce more comprehensive reports than pipeline strategies, because it adapts its search behavior to the specific question rather than following a fixed recipe.
The agent should describe the LangGraph agent strategy as a state machine where the LLM operates as the router. The state object accumulates search results, source metadata, and a running assessment of coverage. At each step, the LLM examines the current state and selects a tool: search_web, search_arxiv, search_pubmed, search_semantic_scholar, search_local_docs, or synthesize. The key architectural insight is that the graph has cycles — the agent can return to search nodes multiple times — unlike a pipeline which moves strictly forward. The termination condition is the LLM's own judgment that sufficient evidence exists, optionally bounded by a maximum iteration count to prevent runaway loops.
langgraph-agent from the strategy dropdown. This is not the default — LDR ships with a simpler pipeline strategy enabled to ensure broad compatibility.
A common agentic pattern is ReAct (Reason + Act): the LLM alternates between reasoning about what to do and executing a tool call, in a flat loop. LangGraph extends this by introducing graph structure — explicit nodes with defined transitions, conditional edges, and shared state. This allows more complex workflows: parallel tool calls, sub-graphs for specialized tasks, and checkpointing for resumption. In LDR's case, the graph includes nodes for each search engine, a synthesis node, and conditional edges that route based on the LLM's assessment of result quality and topical coverage. The graph structure also makes the workflow inspectable and debuggable in ways that a flat ReAct loop is not.
SimpleQA is an evaluation benchmark developed by OpenAI consisting of short, factual questions with verifiable answers. A 95% score indicates that LDR, when paired with GPT-4.1-mini, can answer straightforward factual queries with high reliability. This is notable because the system is performing multi-step retrieval and synthesis, not simply recalling training data. The benchmark validates that LDR's search orchestration and synthesis pipeline faithfully preserves factual accuracy from source material. However, SimpleQA tests relatively simple factual recall — it does not measure the system's ability to handle nuanced, multi-faceted research questions where the agent strategy's adaptive search becomes most valuable.
LDR's power comes from its ability to federate searches across fundamentally different information ecosystems. A web search engine, an arXiv query, and a PubMed search are not interchangeable — they have different query syntaxes, different result structures, and different strengths. Web search excels at recency and breadth; arXiv provides preprints before peer review; PubMed offers structured biomedical literature with MeSH term indexing; Semantic Scholar provides citation graph analysis.
Configuring these sources correctly is the difference between a research assistant that returns superficial web summaries and one that produces literature-review-quality synthesis. Each source has parameters that affect result quality: the number of results to retrieve, whether to fetch full text or abstracts only, and how to handle rate limits.
We will use our AI agent to generate a configuration that balances thoroughness with performance, tailored to a specific research domain.
The agent should describe the configuration approach: for the LangGraph agent strategy, source selection is delegated to the LLM, so the configuration specifies available sources and their parameters rather than a fixed query order. A reasonable configuration retrieves 10–15 results per source, enables full-text fetching for arXiv (PDFs are freely available), uses abstract-only mode for PubMed (full text requires institutional access for many journals), and sets Semantic Scholar to return citation counts for relevance ranking. The agent should note that SearXNG configuration (at http://localhost:8080/preferences) controls which upstream search engines are active for web queries, and recommend enabling at least Google Scholar alongside general web engines.
semanticscholar.org/product/api. SearXNG requires no API keys — it scrapes search engines directly.
http://localhost:5000 and through a TOML configuration file in the data volume. For reproducible setups, prefer the configuration file; for experimentation, use the web UI.
SearXNG is a meta-search engine that operates by issuing HTTP requests to search engine frontends (Google, Bing, DuckDuckGo, etc.) and parsing the HTML responses — essentially automating what a human would do in a browser. This means it requires no API keys and incurs no per-query costs, but it is subject to rate limiting and CAPTCHAs if query volume is high. Running SearXNG locally as part of the Docker stack gives LDR a privacy-preserving search capability: your queries go to SearXNG on localhost, which fans out to search engines from your server's IP, and results return without any search engine knowing that an AI system is the ultimate consumer of the results.
The most distinctive feature of LDR is its personal knowledge base — an encrypted, locally-stored library of documents that grows with each research session. This transforms LDR from a stateless question-answering tool into a compounding research assistant: each session deposits sources into the library, and future sessions search the library alongside live sources.
The knowledge base operates on a straightforward pipeline: documents (PDFs, web pages, articles) are downloaded, their text is extracted, the text is chunked into passages, each passage is embedded into a vector representation, and the vectors are stored in an indexed database. When you query the knowledge base, your question is similarly embedded and matched against stored passages by vector similarity.
Critically, the entire database is encrypted with AES-256 via SQLCipher. Each user gets an isolated database. There is no password recovery mechanism — this is a deliberate design choice that ensures true zero-knowledge security. Even someone with physical access to the server cannot read the data without the user's passphrase.
We will now use our agent to design a workflow for systematically populating this knowledge base from a research session.
The agent should describe the workflow in three phases. First, curation: after a research session, review the cited sources in the report and select those with lasting reference value — peer-reviewed papers, authoritative reports, primary data sources — while skipping ephemeral news articles or redundant sources. Second, ingestion: use LDR's download-and-index feature (available in the web UI for each cited source) to fetch the document, extract text, chunk it into passages of approximately 500–1000 tokens, embed each chunk using the configured embedding model, and store the vectors in the SQLCipher database. Third, verification: run a targeted query that should match the newly added document — for example, if you added a paper on GLP-1 agonists, query 'GLP-1 receptor binding affinity' and confirm the paper appears in the local document results. The agent should note that retrieval quality remains stable as the library grows because vector similarity search is sublinear in complexity, though very large libraries (thousands of documents) benefit from periodic re-indexing.
LDR_BOOTSTRAP_ALLOW_UNENCRYPTED=true during initial setup (common for troubleshooting), your knowledge base is stored in plain SQLite without encryption. For any research involving sensitive or proprietary information, ensure SQLCipher is properly configured before adding documents.
When a document is added to the knowledge base, each text chunk is converted into a high-dimensional vector (typically 384 or 768 dimensions) by an embedding model. These vectors capture semantic meaning: passages about similar topics produce vectors that are close together in the embedding space, even if they use different words. At query time, your question is embedded using the same model, and the system finds stored vectors with the highest cosine similarity to the query vector. This is why the knowledge base can find relevant passages even when the exact keywords differ — the search operates on meaning, not string matching. The trade-off is that embedding quality depends entirely on the model used; a small, general-purpose embedding model may not capture domain-specific nuances as well as a specialized one.
Consider two scenarios. In the first, a researcher asks LDR about CRISPR gene editing today, gets a report, and discards it. Six months later, they ask about CRISPR delivery mechanisms — LDR starts from scratch, re-searching the same foundational sources. In the second scenario, the researcher downloads the key papers from the first session into the knowledge base. Six months later, the delivery mechanism query automatically retrieves relevant passages from those stored papers alongside new web results. The synthesis is richer because it connects current findings to the researcher's prior reading. Over dozens of sessions, the knowledge base becomes a personalized research corpus — a curated, searchable subset of the literature that reflects the researcher's specific interests and prior investigations.
I need a system prompt for Local Deep Research that transforms its output from a general synthesis into a structured literature review. The review should have five sections: (1) Research Question — a precise restatement of the query, (2) Consensus Findings — claims supported by multiple independent sources with citation counts, (3) Contested or Preliminary Findings — claims supported by only one source or where sources disagree, with explicit notation of the disagreement, (4) Methodological Notes — any limitations in the cited studies that affect confidence (sample size, study design, conflict of interest), and (5) Open Questions — specific gaps in the literature that the search revealed. The prompt should instruct the LLM to use at least 8 sources before synthesizing, to prefer peer-reviewed sources over preprints and preprints over news articles, and to flag any claim that rests on a single source. Format the output in markdown with each section clearly headed.
Deploying a research assistant is straightforward; knowing whether to trust its output is the harder problem. LDR's ~95% SimpleQA score is encouraging for factual queries, but real research questions are rarely simple factual lookups. They involve synthesis, judgment about source quality, and recognition of uncertainty — none of which SimpleQA measures.
A disciplined evaluation approach requires three dimensions. Factual accuracy: are the specific claims in the report verifiable against the cited sources? Source coverage: did the agent find the most relevant and authoritative sources, or did it settle for whatever appeared first? Synthesis quality: does the report merely concatenate source summaries, or does it identify patterns, contradictions, and implications across sources?
We will use our agent to design a lightweight evaluation rubric that you can apply to any LDR output, allowing you to calibrate your trust in the system over time and identify systematic weaknesses.
The agent should produce a rubric with three dimensions, each scored on a 1–5 scale. Factual Accuracy (1–5): select three specific claims from the report, locate them in the cited sources, and verify. Score 5 if all three are accurately represented with appropriate nuance; score 1 if any claim is fabricated or materially misrepresented. Source Coverage (1–5): assess whether the sources span multiple databases (web, arXiv, PubMed), include both recent and foundational works, and represent diverse perspectives. Score 5 for comprehensive, multi-source coverage; score 1 for reliance on a single source type. Synthesis Quality (1–5): check whether the report identifies cross-source patterns, notes contradictions, and draws conclusions that no single source states explicitly. Score 5 for genuine analytical synthesis; score 1 for sequential source summaries with no integration. The rubric should include a note that scores below 3 on any dimension warrant re-running the query with a different strategy or more explicit instructions.
The SimpleQA benchmark measures a system's ability to answer short, factual questions with verifiable answers — questions like 'What year was the transistor invented?' or 'What is the capital of Bhutan?' A 95% score on this benchmark tells you that LDR's retrieval and synthesis pipeline faithfully preserves factual information from sources. It does not tell you whether the system can identify the most authoritative sources, handle ambiguity, recognize when a question has no clear consensus answer, or produce analysis that goes beyond what any single source contains. These are the capabilities that matter for genuine research, and they require human evaluation — at least until we have benchmarks sophisticated enough to measure them.
In this workshop, we deployed a fully local, privacy-preserving AI research assistant using Local Deep Research, Ollama, and SearXNG. We examined the architectural distinction between pipeline-based research strategies — which follow a fixed sequence of search and synthesis steps — and the LangGraph agent strategy, where the language model autonomously navigates a graph of specialized search tools based on its evolving understanding of the question.
We configured multi-source search orchestration across web, arXiv, PubMed, and Semantic Scholar, understanding that source diversity matters more than source quantity for research reliability. We built a personal knowledge base using LDR's encrypted document indexing, establishing a workflow where each research session deposits curated sources into a vector-indexed library that enriches all future queries.
Finally, we developed an evaluation rubric for assessing research output quality — recognizing that benchmark scores like SimpleQA's 95% validate factual accuracy on simple queries but do not measure the synthesis depth and source coverage that distinguish useful research from superficial summarization.
The system we built is not merely a question-answering tool. It is a research infrastructure that compounds in value over time: each session adds to the knowledge base, each evaluation calibrates your trust in the output, and each prompt refinement improves the quality of future reports.