Onyx (formerly Danswer) has emerged as one of the most capable open-source AI platforms available, consolidating retrieval-augmented generation, deep research, code execution, and tool integration into a single deployable stack. Its trajectory reflects a broader shift in the AI tooling landscape: organizations increasingly want to own their AI infrastructure rather than rent it from a SaaS provider. The reasons are straightforward — data sovereignty, cost control, and the ability to customize every layer of the system.
What makes Onyx particularly notable is its architecture. Rather than being a thin wrapper around an LLM API, it is a full application layer: a hybrid search index (combining vector and keyword retrieval via Vespa), a background processing pipeline for ingesting documents from 50+ connectors, a model server for running local embedding and reranking models, and an agent framework that ties it all together. It supports every major LLM provider — from self-hosted options like Ollama to proprietary APIs from Anthropic, OpenAI, and Google — meaning the same platform can serve a privacy-focused team running everything locally and an enterprise using Claude or GPT-4o.
In this session, we will deploy Onyx locally, configure an LLM provider, connect a knowledge source, and build a custom agent that retrieves and reasons over that data. The goal is not merely to follow a procedure, but to understand the architectural decisions that make a platform like this work — and to develop the prompting skills needed to have an AI agent guide us through each stage.
Before running any deployment command, it is worth understanding what we are actually standing up. Onyx is not a single application — it is an orchestrated set of services, each with a distinct role. A deployment command that appears simple on the surface (a single curl | bash) actually provisions an entire distributed system on your machine.
The core services are:
This architecture reflects a fundamental design principle in RAG systems: retrieval quality depends on the indexing pipeline as much as the generation model. Onyx runs its own embedding models so that document chunks are encoded in a consistent vector space, and Vespa's hybrid index ensures that both semantic similarity and exact keyword matches contribute to retrieval.
Onyx also offers a Lite deployment mode — a stripped-down configuration that omits the model server, Vespa, and background workers. Lite mode is essentially a chat UI with agent capabilities but without local RAG. It requires under 1 GB of memory and is useful for quickly testing the interface or for teams that only need chat and tool-calling features without document retrieval.
The agent should produce a clear comparison of the two deployment modes, noting that Standard requires approximately 16 GB of RAM (Vespa alone consumes 4–6 GB, the model server another 2–4 GB) while Lite operates comfortably under 1 GB. It should identify Vespa as the most resource-intensive component and recommend Standard for this workshop since RAG is a central learning objective. If your machine has fewer than 16 GB of RAM, the agent should suggest either closing other applications or using Lite mode with the understanding that connector and RAG steps will not apply.
There are two reasons. First, embedding models and chat models serve fundamentally different purposes. Embedding models encode text into fixed-dimensional vectors optimized for similarity search; they are small, fast, and deterministic. Using a large chat model for this task would be orders of magnitude more expensive and slower. Second, by running its own embedding model, Onyx ensures that the vector space is consistent across all indexed documents regardless of which chat LLM the user selects. You can switch from Claude to GPT-4o without re-indexing your entire document corpus — the embeddings remain valid because they were generated by Onyx's own model server.
With the architecture understood, we can now deploy. Onyx provides a single-command installer that handles pulling Docker images, generating configuration files, and launching all services via Docker Compose.
The deployment itself is straightforward, but the critical configuration step that follows — connecting an LLM provider — is where architectural understanding pays off. Onyx separates the retrieval layer (embeddings, search, reranking) from the generation layer (the chat LLM). This means you must configure at least one LLM provider for the generation layer, but the retrieval layer works independently using Onyx's built-in model server.
Onyx supports three categories of LLM provider:
Provider configuration happens in the Onyx admin panel after the first launch. The admin panel is accessible at the root URL of your deployment (typically http://localhost:80 for Standard or http://localhost:3000 for Lite). On first access, you will create an admin account and be guided through initial setup, which includes selecting an LLM provider.
The agent should provide the one-line install command (curl -fsSL https://onyx.app/install_onyx.sh | bash) and explain what it does: clones the repository, generates environment configuration, and runs docker compose up. For Ollama users, it should note that the Ollama endpoint must be set to http://host.docker.internal:11434 so the Onyx container can reach the host machine's Ollama server, and that a model (e.g., llama3) must already be pulled. For Anthropic or OpenAI, it should explain that you will enter your API key and select a model in the admin panel's setup wizard. The agent should suggest verifying deployment with docker compose ps to confirm all containers show a healthy status.
.env files directly (though environment variable overrides are supported for automated deployments).
When a user asks a question, Onyx's retrieval pipeline executes two parallel searches against Vespa: a dense vector search (using the query's embedding to find semantically similar document chunks) and a sparse keyword search (using BM25-style term matching). The results from both searches are merged and reranked by a cross-encoder model running on the model server. This hybrid approach addresses a well-known limitation of pure vector search: it struggles with exact terms, proper nouns, and acronyms that carry high information density but may not be semantically close to any training data. By combining both retrieval signals, Onyx achieves substantially higher recall than either method alone.
A running Onyx instance without connected knowledge sources is functionally equivalent to a chat UI — useful, but not leveraging the platform's core capability. The power of Onyx lies in its ability to ingest, index, and retrieve information from your organization's actual data sources.
Onyx provides over 50 connectors, organized into categories:
Each connector follows the same lifecycle: authenticate, specify scope (which repositories, channels, or folders to index), set a sync schedule, and let the background worker handle ingestion. The background worker fetches documents, splits them into chunks, generates embeddings via the model server, and indexes the chunks into Vespa. Depending on the volume of data, initial indexing can take minutes to hours.
A critical concept here is Document Sets — curated subsets of your indexed documents. After connecting multiple sources, you can create Document Sets that group related content (e.g., "Engineering Documentation" combining GitHub repos and Confluence spaces). These sets become the knowledge boundaries you assign to specific agents.
For this workshop, we will use the simplest connector: file upload. This requires no external credentials and provides immediate results.
The agent should explain the file upload process: navigate to Admin Panel > Connectors > File, upload one or more files (PDF, DOCX, TXT, MD are all supported), and wait for the background worker to process them. It should note that processing time depends on file size and the model server's throughput — a handful of small documents should complete within a minute or two. To verify, navigate to the main chat interface and ask a question that can only be answered from the uploaded documents. A successful response will include citation markers linking back to specific document chunks. The agent should also recommend creating a Document Set for the uploaded files, explaining that this allows you to scope future agents to only this knowledge.
Document chunking is one of the most consequential decisions in any RAG system. Chunks that are too small lose context — a paragraph fragment may not contain enough information to answer a question on its own. Chunks that are too large waste the LLM's context window and dilute the signal with irrelevant text. Onyx uses a hierarchical chunking approach that respects document structure (headings, sections, paragraph boundaries) rather than naively splitting on a fixed character count. During retrieval, the cross-encoder reranker evaluates each chunk's relevance to the query, and only the top-ranked chunks are included in the LLM's prompt. This two-stage process — broad retrieval followed by precision reranking — is what allows Onyx to maintain quality even with large document corpora.
With documents indexed and searchable, we can now build a custom agent — the component that transforms a generic LLM into a specialized assistant scoped to a particular domain, knowledge base, and set of capabilities.
In Onyx, an agent is defined by four dimensions:
Agent creation happens through the admin panel under the Agents section. The interface is straightforward, but the design decisions — what instructions to write, which knowledge to scope, which tools to enable — are where the real work lies.
The agent should help you draft a system prompt that establishes the domain, mandates citation of sources, and defines behavior for out-of-scope questions. It should recommend enabling the Search tool (for RAG retrieval) and disabling web search and code execution unless your use case requires them — fewer tools mean less ambiguity for the LLM about which tool to invoke. It should suggest 2–3 starter messages that demonstrate the agent's intended use, such as 'What does the documentation say about [specific topic]?' The agent should walk you through the admin panel flow: navigate to Agents > Create New Agent, enter the name and instructions, attach your Document Set, select the enabled tools, and save.
Onyx's agent framework uses the LLM itself to determine whether a query requires retrieval. When a user sends a message, the agent evaluates whether the question is likely answerable from its own training data or whether it needs to search the connected knowledge base. This is an agentic decision — the LLM examines the query and the available tools, then decides which tool (if any) to invoke. For a documentation agent with search enabled, most queries will trigger retrieval. However, a greeting like 'Hello' or a meta-question like 'What can you help me with?' will typically be handled directly. This tool-selection behavior is governed by the LLM's reasoning capabilities, which is one reason why model choice matters at the agent level.
The Model Context Protocol (MCP) is a standardization layer that allows LLM-based agents to invoke external tools through a uniform interface. Onyx's MCP integration means that any MCP-compatible server — whether it provides access to a database, an internal API, a CI/CD pipeline, or a third-party service — can be connected to Onyx and made available to specific agents.
The mental model is straightforward: an MCP server exposes a set of tools (each with a name, description, and input schema), and Onyx discovers those tools and presents them to the LLM as callable functions. When the LLM decides to invoke a tool, Onyx handles the protocol communication — sending the request to the MCP server and returning the result to the LLM for incorporation into its response.
MCP servers come in two transport flavors:
Configuring MCP in Onyx involves two steps: first, register the MCP server in the admin panel (under Tools); second, enable specific MCP tools on the agents that should have access to them.
This architecture is powerful because it decouples the intelligence layer (the LLM) from the capability layer (what the agent can do). You can add new capabilities to an existing agent simply by connecting a new MCP server — no code changes to Onyx itself are required.
The agent should describe the two-step process: first, add the MCP server in the admin panel's Tools section by providing the server URL (for SSE) or command (for stdio), then navigate to the agent configuration and enable the newly discovered tools. It should explain that MCP tools are self-describing — each tool provides a name, description, and JSON schema for its inputs — so the LLM can reason about when and how to use them. For security, it should note that Onyx supports authentication options for MCP connections and that agents with tool access should be carefully scoped (e.g., a read-only database tool is safer than one with write access). Testing should involve asking the agent a question that requires using the MCP tool and verifying that the tool invocation appears in the response metadata.
Onyx supports both MCP and OpenAPI specification-based tool integration. The key difference is in the protocol layer. OpenAPI tools are defined by a static specification document (a JSON or YAML file describing REST endpoints); Onyx translates each endpoint into a callable tool. MCP tools are dynamically discovered from a running server and communicate through a standardized bidirectional protocol. In practice, OpenAPI integration is simpler for existing REST APIs that already have a specification, while MCP is more flexible for tools that need stateful interactions, streaming responses, or complex authentication flows. Both approaches result in the same end-user experience: the LLM can invoke the tool during a conversation.
I need to create a custom agent in Onyx that serves as a new-employee onboarding assistant. Here is what I need:
1. System instructions that establish the agent as a friendly but precise onboarding guide. It should answer questions about company policies, benefits, and procedures based only on the provided documentation. If a question falls outside the available documents, it should direct the employee to HR rather than speculate.
2. I have uploaded our employee handbook (PDF) and IT setup guide (markdown) as a Document Set called 'Onboarding Docs'. The agent should be scoped to only this set.
3. Enable the Search tool for RAG retrieval. Disable web search, code execution, and image generation — they are not relevant for this use case.
4. Suggest 3 starter messages that a new employee would find immediately useful, such as 'How do I set up my development environment?' or 'What is the PTO policy?'
5. Use our default LLM provider (Claude) for this agent.
Walk me through creating this in the Onyx admin panel and help me refine the system instructions.
In this session, we deployed Onyx — a full-featured open-source AI platform — and progressed from understanding its multi-service architecture through to building custom agents with scoped knowledge and tool access.
We began by examining the architectural separation between Onyx's retrieval layer (Vespa hybrid search, embedding model server, background indexing workers) and its generation layer (pluggable LLM providers). This separation is a key design decision: it allows you to change LLM providers without re-indexing documents, and it ensures retrieval quality is controlled by dedicated models optimized for that task.
We deployed the platform, configured an LLM provider, and connected a knowledge source through the file upload connector — the simplest entry point into Onyx's 50+ connector ecosystem. We then built a custom agent defined by four dimensions: instructions, knowledge scope, tools, and model selection. Finally, we examined how MCP integration extends agent capabilities by connecting external tools through a standardized protocol.
The central insight of this session is that a self-hosted AI platform is not merely a private ChatGPT alternative. It is an infrastructure decision — one that gives you control over the data pipeline, the retrieval strategy, the model selection, and the tool ecosystem. Onyx packages these concerns into a coherent, deployable system, but the design decisions about how to configure it — what knowledge to scope, what instructions to write, what tools to enable — remain squarely in your hands.