Hardware - Bumblebee

Bumblebee uses Ollama for local inference. VRAM requirements depend on which models you run and how large a context window you need.

GPU VRAM guide

Minimum (~8 GB)
Recommended (~16 GB)
Comfortable (24+ GB)
Spacious (32+ GB)

Example GPUs: RX 7600 8 GB, RTX 3050 8 GB, Arc A770 8 GBSmaller or quantized models only. Use aggressive quantization or point reflex at gemma4:e4b in entity YAML to lighten load. CPU-only via Ollama works for experiments but expect slow turns.

Example GPUs: RTX 4060 Ti 16 GB, RTX 4070 Ti Super 16 GB, RX 6800 XT 16 GBThe common target for the default Bumblebee stack. Runs gemma4:26b for both reflex and deliberate reasoning (same weights, one model loaded). The soma noise engine reuses the reflex model at zero extra VRAM cost. Close other GPU-heavy apps if near the limit.

Default models

Model	Role	Approx. VRAM
`gemma4:26b`	Reflex + deliberate chat	~16 GB
`nomic-embed-text`	Memory similarity embeddings	~274 MB

The embedding model loads on demand alongside the chat model — there is no separate embedding service. Both reflex and deliberate use the same weights with different token budgets, so only one model needs to be loaded at a time.

Optional models

Model	Role	When to use
`gemma4:e4b`	Fast reflex layer	Tight VRAM (~8 GB setups). Set as `cognition.reflex_model` in entity YAML.
`gemma3:1b`	Dedicated noise model	Different character of inner voice. Set in `soma.noise.model`. Costs extra VRAM.

Context window and VRAM

Larger context windows use more memory. The default max_context_tokens: 32768 (32K) is a good balance for 16 GB cards.

cognition:
  max_context_tokens: 65536   # 64K — needs more VRAM headroom

Ollama settings

For single-GPU setups, these defaults prevent overcommitting. The ollama:reset npm script sets them automatically.

OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_KEEP_ALIVE=60s
OLLAMA_CONTEXT_LENGTH=16384
OLLAMA_NUM_PARALLEL=1

npm run ollama:reset

MoE note

Gemma 4 uses a Mixture-of-Experts architecture. Active parameters per token are lower than the full model size, so actual VRAM usage during inference can be less than the raw parameter count suggests. Real-world fit depends on context length, thinking budget, quantization level, and concurrent platform activity.

​GPU VRAM guide

​Default models

​Optional models

​Context window and VRAM

​Ollama settings

​MoE note

GPU VRAM guide

Default models

Optional models

Context window and VRAM

Ollama settings

MoE note