← Back to calendar
Workshop

Deploy Onyx: Your Own Open-Source AI Platform with Agentic RAG and MCP

Stand up a self-hosted AI workspace with RAG pipelines, custom agents, and 50+ connectors in one command
35 min open-source RAG MCP self-hosted agents

What's happening

Onyx (formerly Danswer) has emerged as one of the most capable open-source AI platforms available, consolidating retrieval-augmented generation, deep research, code execution, and tool integration into a single deployable stack. Its trajectory reflects a broader shift in the AI tooling landscape: organizations increasingly want to own their AI infrastructure rather than rent it from a SaaS provider. The reasons are straightforward — data sovereignty, cost control, and the ability to customize every layer of the system.

What makes Onyx particularly notable is its architecture. Rather than being a thin wrapper around an LLM API, it is a full application layer: a hybrid search index (combining vector and keyword retrieval via Vespa), a background processing pipeline for ingesting documents from 50+ connectors, a model server for running local embedding and reranking models, and an agent framework that ties it all together. It supports every major LLM provider — from self-hosted options like Ollama to proprietary APIs from Anthropic, OpenAI, and Google — meaning the same platform can serve a privacy-focused team running everything locally and an enterprise using Claude or GPT-4o.

In this session, we will deploy Onyx locally, configure an LLM provider, connect a knowledge source, and build a custom agent that retrieves and reasons over that data. The goal is not merely to follow a procedure, but to understand the architectural decisions that make a platform like this work — and to develop the prompting skills needed to have an AI agent guide us through each stage.

1

Understand the Onyx Architecture Before Deploying

Before running any deployment command, it is worth understanding what we are actually standing up. Onyx is not a single application — it is an orchestrated set of services, each with a distinct role. A deployment command that appears simple on the surface (a single curl | bash) actually provisions an entire distributed system on your machine.

The core services are:

  • API Server — a FastAPI backend that handles chat requests, coordinates LLM calls, and retrieves documents from the search index.
  • Web Server — a Next.js frontend providing the user interface.
  • Background Worker — a process that runs connector sync jobs, fetches documents from external sources, generates embeddings, and indexes them into the search engine.
  • Model Server — runs local deep learning models (embedding and reranking) used during both indexing and retrieval.
  • Vespa — the search engine that powers Onyx's hybrid retrieval, combining dense vector search with traditional keyword matching.
  • PostgreSQL — stores metadata, user accounts, and configuration.
  • Redis — manages the task queue for background jobs.
  • Nginx — reverse proxy that routes traffic to the appropriate backend service.

This architecture reflects a fundamental design principle in RAG systems: retrieval quality depends on the indexing pipeline as much as the generation model. Onyx runs its own embedding models so that document chunks are encoded in a consistent vector space, and Vespa's hybrid index ensures that both semantic similarity and exact keyword matches contribute to retrieval.

Onyx also offers a Lite deployment mode — a stripped-down configuration that omits the model server, Vespa, and background workers. Lite mode is essentially a chat UI with agent capabilities but without local RAG. It requires under 1 GB of memory and is useful for quickly testing the interface or for teams that only need chat and tool-calling features without document retrieval.

Ask your agent
Ask your AI agent to explain the trade-offs between Onyx Standard and Onyx Lite deployments and help you determine which is appropriate for your machine.
Think about it
  • What are the key resources on your machine — how much RAM, how many CPU cores, and how much free disk space do you have?
  • Which capabilities do you actually need: just chat and agents, or full document retrieval with RAG?
  • What happens if you deploy Standard on a machine with insufficient resources — which service is likely to fail first?
What the agent gives back

The agent should produce a clear comparison of the two deployment modes, noting that Standard requires approximately 16 GB of RAM (Vespa alone consumes 4–6 GB, the model server another 2–4 GB) while Lite operates comfortably under 1 GB. It should identify Vespa as the most resource-intensive component and recommend Standard for this workshop since RAG is a central learning objective. If your machine has fewer than 16 GB of RAM, the agent should suggest either closing other applications or using Lite mode with the understanding that connector and RAG steps will not apply.

Tip
If you do not already have Docker and Docker Compose V2 installed, do so before proceeding. Onyx requires Docker Engine 20+ with Compose V2. On macOS or Windows, Docker Desktop satisfies both requirements.
Warning
Onyx Standard's memory footprint is substantial. On a machine with 16 GB of total RAM, expect the system to feel sluggish during initial indexing when the model server is generating embeddings. Close unnecessary applications before deploying.
Why does Onyx run its own embedding models instead of using the LLM provider's API?

There are two reasons. First, embedding models and chat models serve fundamentally different purposes. Embedding models encode text into fixed-dimensional vectors optimized for similarity search; they are small, fast, and deterministic. Using a large chat model for this task would be orders of magnitude more expensive and slower. Second, by running its own embedding model, Onyx ensures that the vector space is consistent across all indexed documents regardless of which chat LLM the user selects. You can switch from Claude to GPT-4o without re-indexing your entire document corpus — the embeddings remain valid because they were generated by Onyx's own model server.

2

Deploy Onyx and Configure an LLM Provider

With the architecture understood, we can now deploy. Onyx provides a single-command installer that handles pulling Docker images, generating configuration files, and launching all services via Docker Compose.

The deployment itself is straightforward, but the critical configuration step that follows — connecting an LLM provider — is where architectural understanding pays off. Onyx separates the retrieval layer (embeddings, search, reranking) from the generation layer (the chat LLM). This means you must configure at least one LLM provider for the generation layer, but the retrieval layer works independently using Onyx's built-in model server.

Onyx supports three categories of LLM provider:

  1. Self-hosted — Ollama, LiteLLM, vLLM, or any OpenAI-compatible endpoint. These give you full data sovereignty but require local compute.
  2. Proprietary API — Anthropic (Claude), OpenAI (GPT), Google (Gemini). These offer the strongest models but require API keys and send data externally.
  3. Hybrid — You can configure multiple providers simultaneously and assign different models to different agents. A sensitive-data agent might use a local Ollama model while a general-purpose agent uses Claude.

Provider configuration happens in the Onyx admin panel after the first launch. The admin panel is accessible at the root URL of your deployment (typically http://localhost:80 for Standard or http://localhost:3000 for Lite). On first access, you will create an admin account and be guided through initial setup, which includes selecting an LLM provider.

Ask your agent
Ask your AI agent to walk you through deploying Onyx Standard locally and configuring your preferred LLM provider (Ollama for fully local, or Anthropic/OpenAI for cloud-hosted models).
Think about it
  • What prerequisites does the deployment script expect — is Docker running, is the required port (80) available?
  • If you choose Ollama, what additional setup is needed before Onyx can reach it? Consider how Docker containers communicate with host services.
  • What is the significance of the `host.docker.internal` address, and on which operating systems does it work natively?
  • After running the install command, how will you verify that all services started successfully?
What the agent gives back

The agent should provide the one-line install command (curl -fsSL https://onyx.app/install_onyx.sh | bash) and explain what it does: clones the repository, generates environment configuration, and runs docker compose up. For Ollama users, it should note that the Ollama endpoint must be set to http://host.docker.internal:11434 so the Onyx container can reach the host machine's Ollama server, and that a model (e.g., llama3) must already be pulled. For Anthropic or OpenAI, it should explain that you will enter your API key and select a model in the admin panel's setup wizard. The agent should suggest verifying deployment with docker compose ps to confirm all containers show a healthy status.

API Key Note
If using Anthropic or OpenAI, you will need an active API key. Onyx stores provider credentials in its PostgreSQL database, not in environment files — so they are configured through the admin UI, not by editing .env files directly (though environment variable overrides are supported for automated deployments).
Tip
You can configure multiple LLM providers and switch between them per-agent. This is useful for comparing model quality or for assigning cost-appropriate models to different tasks.
What does the hybrid search index actually do during retrieval?

When a user asks a question, Onyx's retrieval pipeline executes two parallel searches against Vespa: a dense vector search (using the query's embedding to find semantically similar document chunks) and a sparse keyword search (using BM25-style term matching). The results from both searches are merged and reranked by a cross-encoder model running on the model server. This hybrid approach addresses a well-known limitation of pure vector search: it struggles with exact terms, proper nouns, and acronyms that carry high information density but may not be semantically close to any training data. By combining both retrieval signals, Onyx achieves substantially higher recall than either method alone.

At this point, you should have Onyx running locally with all containers healthy (verify with `docker compose ps`), an admin account created, and at least one LLM provider configured. You should be able to open the Onyx web interface and send a basic chat message that receives a response from your configured model. If any container is in a restart loop, check its logs with `docker compose logs <service-name>` — the most common issues are port conflicts (another service on port 80) and insufficient memory (Vespa failing to start).
3

Connect a Knowledge Source via a Connector

A running Onyx instance without connected knowledge sources is functionally equivalent to a chat UI — useful, but not leveraging the platform's core capability. The power of Onyx lies in its ability to ingest, index, and retrieve information from your organization's actual data sources.

Onyx provides over 50 connectors, organized into categories:

  • Collaboration and documents — Google Drive, Confluence, Notion, SharePoint, Dropbox
  • Code and engineering — GitHub (repositories, issues, pull requests), GitLab, Jira, Linear
  • Communication — Slack, Microsoft Teams, Discord
  • Support and CRM — Zendesk, Salesforce, HubSpot, Intercom
  • Web and files — Website scraping (via sitemap or URL list), direct file upload (PDF, DOCX, TXT), S3 or GCS buckets

Each connector follows the same lifecycle: authenticate, specify scope (which repositories, channels, or folders to index), set a sync schedule, and let the background worker handle ingestion. The background worker fetches documents, splits them into chunks, generates embeddings via the model server, and indexes the chunks into Vespa. Depending on the volume of data, initial indexing can take minutes to hours.

A critical concept here is Document Sets — curated subsets of your indexed documents. After connecting multiple sources, you can create Document Sets that group related content (e.g., "Engineering Documentation" combining GitHub repos and Confluence spaces). These sets become the knowledge boundaries you assign to specific agents.

For this workshop, we will use the simplest connector: file upload. This requires no external credentials and provides immediate results.

Ask your agent
Ask your AI agent to guide you through uploading a set of documents (PDFs, text files, or markdown files) to Onyx and verifying that they are indexed and searchable.
Think about it
  • What types of files does Onyx's file upload connector accept, and are there size limitations?
  • After uploading, how long should you expect to wait before the documents appear in search results? What determines this latency?
  • How would you verify that the indexing pipeline worked correctly — what should you search for, and what does a successful retrieval look like?
  • What is the role of Document Sets, and why might you want to create one even if you have only a single connector?
What the agent gives back

The agent should explain the file upload process: navigate to Admin Panel > Connectors > File, upload one or more files (PDF, DOCX, TXT, MD are all supported), and wait for the background worker to process them. It should note that processing time depends on file size and the model server's throughput — a handful of small documents should complete within a minute or two. To verify, navigate to the main chat interface and ask a question that can only be answered from the uploaded documents. A successful response will include citation markers linking back to specific document chunks. The agent should also recommend creating a Document Set for the uploaded files, explaining that this allows you to scope future agents to only this knowledge.

Tip
For a more realistic test, upload a technical document you are already familiar with — internal documentation, a product spec, or a research paper. This makes it easy to evaluate whether the RAG pipeline retrieves relevant passages and whether the LLM synthesizes them accurately.
How does Onyx's chunking strategy affect retrieval quality?

Document chunking is one of the most consequential decisions in any RAG system. Chunks that are too small lose context — a paragraph fragment may not contain enough information to answer a question on its own. Chunks that are too large waste the LLM's context window and dilute the signal with irrelevant text. Onyx uses a hierarchical chunking approach that respects document structure (headings, sections, paragraph boundaries) rather than naively splitting on a fixed character count. During retrieval, the cross-encoder reranker evaluates each chunk's relevance to the query, and only the top-ranked chunks are included in the LLM's prompt. This two-stage process — broad retrieval followed by precision reranking — is what allows Onyx to maintain quality even with large document corpora.

Quick Check

You have uploaded a 200-page technical manual and created a Document Set for it. When you ask Onyx a question that should be answerable from page 47, it returns an irrelevant response citing a different section. What is the most likely cause?
✗ Not quite. While hallucination is always possible, Onyx's agentic RAG framework explicitly presents retrieved chunks in the prompt and instructs the model to cite them. If citations are present but pointing to the wrong sections, the problem is upstream of the LLM.
✓ Correct! This is the most common failure mode in RAG systems. The embedding model may not capture the semantic relationship between your query phrasing and the document's phrasing. Solutions include rephrasing the query to use terminology from the document, or checking whether the relevant passage was chunked in a way that preserves its meaning. Hybrid search (combining vector and keyword) mitigates this, but does not eliminate it entirely.
✗ Not quite. Partial indexing is possible but unlikely with file upload. You can verify by checking the connector status in the admin panel, which shows the number of documents and chunks indexed. If the count seems low relative to the document size, re-trigger the sync.
4

Build a Custom Agent with Scoped Knowledge and Actions

With documents indexed and searchable, we can now build a custom agent — the component that transforms a generic LLM into a specialized assistant scoped to a particular domain, knowledge base, and set of capabilities.

In Onyx, an agent is defined by four dimensions:

  1. Instructions — a system prompt that governs the agent's behavior, tone, and response structure. This is where you encode domain expertise: "You are a technical support agent for Product X. Always cite the relevant documentation section. If the user's question cannot be answered from the provided context, say so explicitly."
  1. Knowledge — one or more Document Sets that define the agent's retrieval scope. An HR agent sees only HR documents; an engineering agent sees only engineering documentation. This scoping is critical for both relevance (reducing noise in retrieval) and security (preventing information leakage across departments).
  1. Tools — the actions the agent can perform beyond generating text. Built-in tools include document search (RAG), web search, code execution, and image generation. Custom tools can be added via OpenAPI specifications or MCP server connections.
  1. Model — which LLM powers this agent. Different agents can use different models, allowing you to optimize for cost, speed, or capability.

Agent creation happens through the admin panel under the Agents section. The interface is straightforward, but the design decisions — what instructions to write, which knowledge to scope, which tools to enable — are where the real work lies.

Ask your agent
Ask your AI agent to help you design and create a custom Onyx agent that serves as a domain expert for the documents you uploaded in Step 3. The agent should retrieve information from your Document Set and respond with cited answers.
Think about it
  • What should the system instructions specify about how the agent handles questions it cannot answer from the available documents?
  • How specific should the instructions be about response format — should you prescribe structure (bullet points, citations) or leave it flexible?
  • Which tools should be enabled for this agent? Consider whether web search or code execution are appropriate for a documentation-focused assistant.
  • What starter messages would help users understand the agent's capabilities and scope?
What the agent gives back

The agent should help you draft a system prompt that establishes the domain, mandates citation of sources, and defines behavior for out-of-scope questions. It should recommend enabling the Search tool (for RAG retrieval) and disabling web search and code execution unless your use case requires them — fewer tools mean less ambiguity for the LLM about which tool to invoke. It should suggest 2–3 starter messages that demonstrate the agent's intended use, such as 'What does the documentation say about [specific topic]?' The agent should walk you through the admin panel flow: navigate to Agents > Create New Agent, enter the name and instructions, attach your Document Set, select the enabled tools, and save.

Tip
System instructions are the highest-leverage configuration in any agent. A well-crafted instruction set can compensate for mediocre retrieval by telling the model exactly how to handle ambiguity, when to ask clarifying questions, and how to format responses. Invest time here.
How does Onyx's agent framework decide when to use RAG versus responding directly?

Onyx's agent framework uses the LLM itself to determine whether a query requires retrieval. When a user sends a message, the agent evaluates whether the question is likely answerable from its own training data or whether it needs to search the connected knowledge base. This is an agentic decision — the LLM examines the query and the available tools, then decides which tool (if any) to invoke. For a documentation agent with search enabled, most queries will trigger retrieval. However, a greeting like 'Hello' or a meta-question like 'What can you help me with?' will typically be handled directly. This tool-selection behavior is governed by the LLM's reasoning capabilities, which is one reason why model choice matters at the agent level.

5

Extend the Agent with MCP Tools

The Model Context Protocol (MCP) is a standardization layer that allows LLM-based agents to invoke external tools through a uniform interface. Onyx's MCP integration means that any MCP-compatible server — whether it provides access to a database, an internal API, a CI/CD pipeline, or a third-party service — can be connected to Onyx and made available to specific agents.

The mental model is straightforward: an MCP server exposes a set of tools (each with a name, description, and input schema), and Onyx discovers those tools and presents them to the LLM as callable functions. When the LLM decides to invoke a tool, Onyx handles the protocol communication — sending the request to the MCP server and returning the result to the LLM for incorporation into its response.

MCP servers come in two transport flavors:

  • SSE (Server-Sent Events) — the MCP server runs as a web service with an HTTP endpoint. Onyx connects to it over the network. This is the typical choice for shared or remote tools.
  • Stdio — the MCP server runs as a local process, communicating over standard input/output. This is common for development and single-machine setups.

Configuring MCP in Onyx involves two steps: first, register the MCP server in the admin panel (under Tools); second, enable specific MCP tools on the agents that should have access to them.

This architecture is powerful because it decouples the intelligence layer (the LLM) from the capability layer (what the agent can do). You can add new capabilities to an existing agent simply by connecting a new MCP server — no code changes to Onyx itself are required.

Ask your agent
Ask your AI agent to explain how you would connect an MCP server to Onyx and assign its tools to your custom agent. Use a concrete example — such as an MCP server that provides access to a SQLite database or a filesystem.
Think about it
  • Where in the Onyx admin panel is MCP server configuration located, and what information do you need to provide?
  • How does the LLM know what an MCP tool does — what metadata does the MCP protocol expose for each tool?
  • What security considerations arise when giving an LLM-powered agent access to external tools that can read or modify data?
  • How would you test that the MCP connection is working before making the tool available to end users?
What the agent gives back

The agent should describe the two-step process: first, add the MCP server in the admin panel's Tools section by providing the server URL (for SSE) or command (for stdio), then navigate to the agent configuration and enable the newly discovered tools. It should explain that MCP tools are self-describing — each tool provides a name, description, and JSON schema for its inputs — so the LLM can reason about when and how to use them. For security, it should note that Onyx supports authentication options for MCP connections and that agents with tool access should be carefully scoped (e.g., a read-only database tool is safer than one with write access). Testing should involve asking the agent a question that requires using the MCP tool and verifying that the tool invocation appears in the response metadata.

Warning
MCP tools that modify external state (writing to a database, sending messages, triggering deployments) require careful consideration. The LLM may invoke tools unexpectedly if its instructions are ambiguous. Always test with read-only tools first, and add confirmation steps for destructive actions.
How does MCP compare to OpenAPI-based tool integration?

Onyx supports both MCP and OpenAPI specification-based tool integration. The key difference is in the protocol layer. OpenAPI tools are defined by a static specification document (a JSON or YAML file describing REST endpoints); Onyx translates each endpoint into a callable tool. MCP tools are dynamically discovered from a running server and communicate through a standardized bidirectional protocol. In practice, OpenAPI integration is simpler for existing REST APIs that already have a specification, while MCP is more flexible for tools that need stateful interactions, streaming responses, or complex authentication flows. Both approaches result in the same end-user experience: the LLM can invoke the tool during a conversation.

You should now have a fully functional Onyx deployment with: (1) all services running and healthy, (2) an LLM provider configured, (3) documents uploaded and indexed, (4) a custom agent with scoped knowledge and search enabled, and (5) an understanding of how MCP extends agent capabilities. Test your agent by asking it several questions — some answerable from your documents, some outside its scope — and observe how it handles each case. Pay attention to citation quality and tool invocation patterns.

Your Turn

Create a second custom agent in Onyx that serves a different purpose from your first — for example, a code review assistant, a meeting summarizer, or an onboarding guide. Configure it with distinct instructions, a different Document Set (upload new documents if needed), and a different combination of enabled tools.
The ability to design purpose-specific agents is one of Onyx's most valuable features. Each agent is defined by the intersection of its instructions, knowledge scope, and available tools. A well-designed set of agents transforms a general-purpose AI platform into a suite of specialized assistants, each optimized for a particular workflow. This exercise tests your ability to translate a use case into a concrete agent configuration.
Think about it
  • What specific role should this agent fill — what questions should it answer, and what actions should it perform?
  • How do the system instructions need to differ from your first agent? Consider tone, response format, and handling of edge cases.
  • Should this agent have access to the same Document Set, a different one, or multiple sets? What are the implications of broader versus narrower knowledge scope?
  • Would enabling additional tools (web search, code execution) enhance this agent's utility, or would it introduce unnecessary complexity?
See a sample prompt
One way you could prompt it
I need to create a custom agent in Onyx that serves as a new-employee onboarding assistant. Here is what I need:

1. System instructions that establish the agent as a friendly but precise onboarding guide. It should answer questions about company policies, benefits, and procedures based only on the provided documentation. If a question falls outside the available documents, it should direct the employee to HR rather than speculate.

2. I have uploaded our employee handbook (PDF) and IT setup guide (markdown) as a Document Set called 'Onboarding Docs'. The agent should be scoped to only this set.

3. Enable the Search tool for RAG retrieval. Disable web search, code execution, and image generation — they are not relevant for this use case.

4. Suggest 3 starter messages that a new employee would find immediately useful, such as 'How do I set up my development environment?' or 'What is the PTO policy?'

5. Use our default LLM provider (Claude) for this agent.

Walk me through creating this in the Onyx admin panel and help me refine the system instructions.

Quick Check

You are designing an Onyx agent for a support team that handles both public product documentation and confidential internal incident reports. How should you configure the agent's knowledge scope?
✗ Not quite. This exposes confidential incident reports to all users of the agent, including those who should not have access to internal data. Document Set scoping is a security boundary, not merely a relevance filter. The LLM will faithfully retrieve and cite confidential documents if they are semantically relevant to the query.
✓ Correct! This approach uses Onyx's access control model correctly. The public agent is available to all team members and customers; the internal agent is restricted to authorized staff. Each agent retrieves only from its scoped Document Set, enforcing a clean separation between public and confidential data. This is the recommended pattern for any deployment where documents have different access levels.
✗ Not quite. System instructions are guidelines for the LLM, not security controls. They can be circumvented through prompt injection or simply ignored when the model judges the instruction to be in tension with being helpful. Never rely on system instructions as an access control mechanism — use architectural boundaries (separate agents, separate Document Sets, RBAC) instead.

Recap

In this session, we deployed Onyx — a full-featured open-source AI platform — and progressed from understanding its multi-service architecture through to building custom agents with scoped knowledge and tool access.

We began by examining the architectural separation between Onyx's retrieval layer (Vespa hybrid search, embedding model server, background indexing workers) and its generation layer (pluggable LLM providers). This separation is a key design decision: it allows you to change LLM providers without re-indexing documents, and it ensures retrieval quality is controlled by dedicated models optimized for that task.

We deployed the platform, configured an LLM provider, and connected a knowledge source through the file upload connector — the simplest entry point into Onyx's 50+ connector ecosystem. We then built a custom agent defined by four dimensions: instructions, knowledge scope, tools, and model selection. Finally, we examined how MCP integration extends agent capabilities by connecting external tools through a standardized protocol.

The central insight of this session is that a self-hosted AI platform is not merely a private ChatGPT alternative. It is an infrastructure decision — one that gives you control over the data pipeline, the retrieval strategy, the model selection, and the tool ecosystem. Onyx packages these concerns into a coherent, deployable system, but the design decisions about how to configure it — what knowledge to scope, what instructions to write, what tools to enable — remain squarely in your hands.

Where to go next

  • Connect a production data source (Confluence, GitHub, Slack) via its dedicated connector and observe how Onyx handles incremental sync and access control propagation.
  • Deploy a custom MCP server (e.g., one wrapping an internal API or database) and attach it to an Onyx agent to create a tool-augmented assistant.
  • Experiment with Onyx's Deep Research feature, which executes multi-step research flows and produces structured reports — useful for competitive analysis, literature review, and due diligence tasks.
  • Evaluate retrieval quality systematically: prepare a set of test questions with known answers, measure whether the correct document chunks are retrieved and cited, and iterate on chunking and embedding strategies.
  • Explore Onyx's enterprise features (SSO, SCIM provisioning, usage analytics) if you plan to deploy for a team or organization.

Sources