Hosted inference (testing)

Bumblebee is local-first: the default path runs open weights on your hardware via Ollama, without shipping prompts to a third-party inference API. That ethos is unchanged in the open-source project. This page documents an optional inference mode for a different job: exercising the harness as a product—routing, memory, tools, platforms, and cognition loops—against the most capable hosted models you choose (often called “frontier” or “bleeding edge” APIs). It is a test and evaluation lever, not a replacement for self-hosted inference as the recommended default.

Why this exists

Goal	Typical setup
Day-to-day use, privacy, no per-token bill for core chat	`local` or `remote_gateway` → your Ollama
Stress-test behavior, compare model families, demo “best case” replies	`openrouter` or `venice` → hosted OpenAI-compatible API

The implementation is intentionally thin: the same OpenAICompatibleTransport and provider surface used for Ollama and the home gateway. No separate “cloud edition” of the harness—only a different BUMBLEBEE_INFERENCE_PROVIDER and API key.

Supported providers

The harness recognizes two named presets (you can override the base URL if a vendor changes endpoints):

Provider	`BUMBLEBEE_INFERENCE_PROVIDER`	API key env (default)	Default base URL (root before `/v1`)
OpenRouter	`openrouter`	`OPENROUTER_API_KEY`	`https://openrouter.ai/api`
Venice AI	`venice`	`VENICE_API_KEY`	`https://api.venice.ai/api`

Both speak the OpenAI-compatible surface the harness already uses: /v1/chat/completions, /v1/embeddings, /v1/models. Health checks use GET /v1/models (not the home gateway’s GET /health on the tunnel root).

Base URL convention

The transport builds API paths as {base_url}/v1/..., same as local Ollama (http://127.0.0.1:11434 + /v1/...). For hosted vendors, BUMBLEBEE_INFERENCE_BASE_URL must be the prefix that ends right before /v1—for OpenRouter that is https://openrouter.ai/api, not a URL that already contains /v1. If you omit BUMBLEBEE_INFERENCE_BASE_URL, the preset defaults above apply.

Bearer token and custom key variable

Requests send Authorization: Bearer <key> using:

OPENROUTER_API_KEY or VENICE_API_KEY by default, or
The env var named in BUMBLEBEE_INFERENCE_API_KEY_ENV, or
Harness inference.api_key_env in YAML when it is not the generic gateway name BUMBLEBEE_INFERENCE_GATEWAY_TOKEN (that name stays reserved for the home tunnel gateway).

This keeps gateway tokens and hosted API keys from colliding conceptually, even though both are Bearer headers.

Configuration

Environment (recommended)

BUMBLEBEE_INFERENCE_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-or-...

# Hosted stacks usually reject Ollama-only fields (e.g. options.num_ctx):
BUMBLEBEE_INFERENCE_PASS_NUM_CTX=false

Or for Venice:

BUMBLEBEE_INFERENCE_PROVIDER=venice
VENICE_API_KEY=...
BUMBLEBEE_INFERENCE_PASS_NUM_CTX=false

Optional overrides:

Variable	Purpose
`BUMBLEBEE_INFERENCE_BASE_URL`	Custom OpenAI-compat root (before `/v1`)
`BUMBLEBEE_INFERENCE_TIMEOUT`	HTTP timeout (seconds)
`BUMBLEBEE_INFERENCE_API_KEY_ENV`	Env var name that holds the Bearer key

Full tables: Environment variables. Copy-paste examples also live in the repo .env.example under “Hosted brain”.

Harness YAML

In configs/default.yaml or an overlay, you can set:

inference:
  provider: openrouter   # or venice — often set via env instead
  base_url: ""           # empty uses provider default
  pass_num_ctx: false    # recommended for strict OpenAI-compat hosts

BUMBLEBEE_INFERENCE_PASS_NUM_CTX=true|false still overrides pass_num_ctx when set.

Models must match the host

models.reflex, models.deliberate, models.embedding, and any per-entity cognition.*_model overrides must use IDs your host actually serves (OpenRouter model slugs, Venice model ids, etc.). The harness will call /v1/models where supported; if your chosen id is wrong, you will see errors from the API, not from Bumblebee-specific logic. Embedding models in particular must exist on that provider if you rely on vector memory features.

Local vs hybrid Railway

Deployment	Brain
`BUMBLEBEE_DEPLOYMENT_MODE=local`	Hosted API from your laptop / CI
`hybrid_railway`	Hosted API from the Railway worker (and API service if it runs LLM paths)—set the same `BUMBLEBEE_INFERENCE_` and API key on `bumblebee-worker`* / `bumblebee-api`. No home tunnel or `INFERENCE_GATEWAY_TOKEN` required for this mode.

Using hosted inference on Railway does not change Postgres, volumes, or platform behavior—only where /v1/chat/completions is executed.

Setup wizard

After you choose hybrid or local in bumblebee setup, the wizard offers an optional step: configure OpenRouter or Venice, write keys into .env, set BUMBLEBEE_INFERENCE_PASS_NUM_CTX=false, and optionally prompt for a custom base URL. For cloud bodies, mirror those variables in Railway yourself (the wizard explains this). Structured onboarding: Setup & onboarding.

Relationship to other docs

Portable inference — mental model: inference as a replaceable endpoint; hosted presets are another endpoint, not a fork.
Gateway — home appliance for your Ollama; orthogonal to OpenRouter/Venice.
Hybrid Railway — default hybrid brain is still tunnel + gateway; this page is the alternative brain for evaluation.

Operational notes

Cost and data handling are between you and the provider; read their terms. This mode is opt-in and key-driven.
Tool calling and long contexts depend on the model and provider, not on Bumblebee’s local-first defaults.
If a host returns errors on extra JSON fields, keep pass_num_ctx off and avoid provider-specific assumptions beyond OpenAI-compat chat/embeddings.

Summary

Open-source posture: defaults stay local and self-hosted; the repo remains Apache 2.0 and fully runnable without any hosted API. Product testing posture: flip BUMBLEBEE_INFERENCE_PROVIDER to openrouter or venice, set the key, align model ids, and run the same harness to see how your entity behaves on the strongest models you subscribe to—without changing the core architecture or licensing story.

​Why this exists

​Supported providers

​Base URL convention

​Bearer token and custom key variable

​Configuration

​Environment (recommended)

​Harness YAML

​Models must match the host

​Local vs hybrid Railway

​Setup wizard

​Relationship to other docs

​Operational notes

​Summary