Portable inference

“Portable” here means: the entity and worker can run anywhere, while weights and Ollama stay where you want them — usually your own GPU — without handing prompts to a third-party API.

Layers

Ollama (or any backend the gateway forwards to) runs the models.
Inference gateway exposes a small OpenAI-compatible HTTP API: health, models list, chat completions, embeddings. Everything is bearer-authenticated.
Tunnel or edge (e.g. Cloudflare Tunnel) exposes only that HTTP port to the internet — not your whole LAN.
Worker or laptop sets BUMBLEBEE_INFERENCE_PROVIDER=remote_gateway and points BUMBLEBEE_INFERENCE_BASE_URL at the tunnel URL.

The Bumblebee process then behaves like it is calling a hosted API, but the compute stays on your hardware.

Why a dedicated gateway

The gateway is intentionally narrow:

No shell, filesystem, entity tools, or admin UI on that port.
Tunneled origin should terminate at the gateway, not at a catch-all reverse proxy that also exposes SSH or NAS UIs.

That keeps “inference path” and “host attack surface” separated. See Gateway for env vars, token setup, and bumblebee gateway helpers.

Swap the middle mile

As long as the worker sees a stable HTTPS URL and passes the same bearer token, you can replace pieces of the chain:

Different tunnel (Tailscale Funnel, frp, WireGuard + nginx, corporate egress) — still forward to 127.0.0.1:<gateway_port>.
Different edge auth (Cloudflare Access, mTLS in front of the gateway) — ensure the client still reaches an OpenAI-compatible /v1/chat/completions and /v1/embeddings with a token the gateway accepts (or terminate auth at the edge and forward with a static internal bearer).

The harness does not care how bytes reach your home — only that BUMBLEBEE_INFERENCE_BASE_URL resolves and the token matches.

Worker agents and hybrid deploy

On Railway, the worker agent (bumblebee worker) holds Telegram/Discord sessions, Postgres memory, and the daemon. It does not need a GPU if remote_gateway is configured: every reflex/deliberate/embed call crosses the tunnel to your gateway → Ollama. That pattern makes the social and memory footprint portable while keeping inference sovereignty on a machine you control. Step-by-step: Hybrid Railway.

Local vs remote in one codebase

Setting	Effect
`deployment.mode: local` (default)	`inference.provider` → local Ollama URL unless overridden.
`hybrid_railway` / `remote_gateway`	HTTP client to gateway; same entity code paths.
`openrouter` / `venice`	Optional hosted OpenAI-compatible API (Bearer key) for harness / product testing with frontier models—still the same entity code paths. Not a fork; local-first defaults unchanged. See Hosted inference (testing).

Optional env overrides are documented in Environment variables and the gateway setup wizard: bumblebee gateway setup.

Mental model

Treat inference as a replaceable endpoint (local socket vs tunneled HTTPS) and treat entity state as durable data (SQLite file vs DATABASE_URL). Bumblebee keeps cognition and memory logic the same; you choose where each layer runs.

​Layers

​Why a dedicated gateway

​Swap the middle mile

​Worker agents and hybrid deploy

​Local vs remote in one codebase

​Mental model