Documentation Index
Fetch the complete documentation index at: https://docs.bumbleagi.com/llms.txt
Use this file to discover all available pages before exploring further.
Bumblebee is local-first: the default path runs open weights on your hardware via Ollama, without shipping prompts to a third-party inference API. That ethos is unchanged in the open-source project.
This page documents an optional inference mode for a different job: exercising the harness as a product—routing, memory, tools, platforms, and cognition loops—against the most capable hosted models you choose (often called “frontier” or “bleeding edge” APIs). It is a test and evaluation lever, not a replacement for self-hosted inference as the recommended default.
Why this exists
| Goal | Typical setup |
|---|
| Day-to-day use, privacy, no per-token bill for core chat | local or remote_gateway → your Ollama |
| Stress-test behavior, compare model families, demo “best case” replies | openrouter or venice → hosted OpenAI-compatible API |
The implementation is intentionally thin: the same OpenAICompatibleTransport and provider surface used for Ollama and the home gateway. No separate “cloud edition” of the harness—only a different BUMBLEBEE_INFERENCE_PROVIDER and API key.
Supported providers
The harness recognizes two named presets (you can override the base URL if a vendor changes endpoints):
| Provider | BUMBLEBEE_INFERENCE_PROVIDER | API key env (default) | Default base URL (root before /v1) |
|---|
| OpenRouter | openrouter | OPENROUTER_API_KEY | https://openrouter.ai/api |
| Venice AI | venice | VENICE_API_KEY | https://api.venice.ai/api |
Both speak the OpenAI-compatible surface the harness already uses: /v1/chat/completions, /v1/embeddings, /v1/models. Health checks use GET /v1/models (not the home gateway’s GET /health on the tunnel root).
Base URL convention
The transport builds API paths as {base_url}/v1/..., same as local Ollama (http://127.0.0.1:11434 + /v1/...). For hosted vendors, BUMBLEBEE_INFERENCE_BASE_URL must be the prefix that ends right before /v1—for OpenRouter that is https://openrouter.ai/api, not a URL that already contains /v1.
If you omit BUMBLEBEE_INFERENCE_BASE_URL, the preset defaults above apply.
Bearer token and custom key variable
Requests send Authorization: Bearer <key> using:
OPENROUTER_API_KEY or VENICE_API_KEY by default, or
- The env var named in
BUMBLEBEE_INFERENCE_API_KEY_ENV, or
- Harness
inference.api_key_env in YAML when it is not the generic gateway name BUMBLEBEE_INFERENCE_GATEWAY_TOKEN (that name stays reserved for the home tunnel gateway).
This keeps gateway tokens and hosted API keys from colliding conceptually, even though both are Bearer headers.
Configuration
Environment (recommended)
BUMBLEBEE_INFERENCE_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-or-...
# Hosted stacks usually reject Ollama-only fields (e.g. options.num_ctx):
BUMBLEBEE_INFERENCE_PASS_NUM_CTX=false
Or for Venice:
BUMBLEBEE_INFERENCE_PROVIDER=venice
VENICE_API_KEY=...
BUMBLEBEE_INFERENCE_PASS_NUM_CTX=false
Optional overrides:
| Variable | Purpose |
|---|
BUMBLEBEE_INFERENCE_BASE_URL | Custom OpenAI-compat root (before /v1) |
BUMBLEBEE_INFERENCE_TIMEOUT | HTTP timeout (seconds) |
BUMBLEBEE_INFERENCE_API_KEY_ENV | Env var name that holds the Bearer key |
Full tables: Environment variables. Copy-paste examples also live in the repo .env.example under “Hosted brain”.
Harness YAML
In configs/default.yaml or an overlay, you can set:
inference:
provider: openrouter # or venice — often set via env instead
base_url: "" # empty uses provider default
pass_num_ctx: false # recommended for strict OpenAI-compat hosts
BUMBLEBEE_INFERENCE_PASS_NUM_CTX=true|false still overrides pass_num_ctx when set.
Models must match the host
models.reflex, models.deliberate, models.embedding, and any per-entity cognition.*_model overrides must use IDs your host actually serves (OpenRouter model slugs, Venice model ids, etc.). The harness will call /v1/models where supported; if your chosen id is wrong, you will see errors from the API, not from Bumblebee-specific logic.
Embedding models in particular must exist on that provider if you rely on vector memory features.
Local vs hybrid Railway
| Deployment | Brain |
|---|
BUMBLEBEE_DEPLOYMENT_MODE=local | Hosted API from your laptop / CI |
hybrid_railway | Hosted API from the Railway worker (and API service if it runs LLM paths)—set the same BUMBLEBEE_INFERENCE_* and API key on bumblebee-worker / bumblebee-api. No home tunnel or INFERENCE_GATEWAY_TOKEN required for this mode. |
Using hosted inference on Railway does not change Postgres, volumes, or platform behavior—only where /v1/chat/completions is executed.
Setup wizard
After you choose hybrid or local in bumblebee setup, the wizard offers an optional step: configure OpenRouter or Venice, write keys into .env, set BUMBLEBEE_INFERENCE_PASS_NUM_CTX=false, and optionally prompt for a custom base URL. For cloud bodies, mirror those variables in Railway yourself (the wizard explains this).
Structured onboarding: Setup & onboarding.
Relationship to other docs
- Portable inference — mental model: inference as a replaceable endpoint; hosted presets are another endpoint, not a fork.
- Gateway — home appliance for your Ollama; orthogonal to OpenRouter/Venice.
- Hybrid Railway — default hybrid brain is still tunnel + gateway; this page is the alternative brain for evaluation.
Operational notes
- Cost and data handling are between you and the provider; read their terms. This mode is opt-in and key-driven.
- Tool calling and long contexts depend on the model and provider, not on Bumblebee’s local-first defaults.
- If a host returns errors on extra JSON fields, keep
pass_num_ctx off and avoid provider-specific assumptions beyond OpenAI-compat chat/embeddings.
Summary
Open-source posture: defaults stay local and self-hosted; the repo remains Apache 2.0 and fully runnable without any hosted API.
Product testing posture: flip BUMBLEBEE_INFERENCE_PROVIDER to openrouter or venice, set the key, align model ids, and run the same harness to see how your entity behaves on the strongest models you subscribe to—without changing the core architecture or licensing story.