Documentation Index
Fetch the complete documentation index at: https://docs.bumbleagi.com/llms.txt
Use this file to discover all available pages before exploring further.
Bumblebee uses Ollama for local inference. VRAM requirements depend on which models you run and how large a context window you need.
GPU VRAM guide
Minimum (~8 GB)
Recommended (~16 GB)
Comfortable (24+ GB)
Spacious (32+ GB)
Example GPUs: RX 7600 8 GB, RTX 3050 8 GB, Arc A770 8 GBSmaller or quantized models only. Use aggressive quantization or point reflex at gemma4:e4b in entity YAML to lighten load. CPU-only via Ollama works for experiments but expect slow turns.
Example GPUs: RTX 4060 Ti 16 GB, RTX 4070 Ti Super 16 GB, RX 6800 XT 16 GBThe common target for the default Bumblebee stack. Runs gemma4:26b for both reflex and deliberate reasoning (same weights, one model loaded). The soma noise engine reuses the reflex model at zero extra VRAM cost. Close other GPU-heavy apps if near the limit.
Example GPUs: RTX 3090 24 GB, RTX 4090 24 GB, RX 7900 XTX 24 GBComfortable dual-model setups with headroom. Room for larger context windows, higher thinking budgets, or a dedicated noise model without juggling VRAM.
Example GPUs: RTX 5090 32 GB, professional cardsRoom for larger deliberate models, extended context (64K+), or separate deliberate weights alongside the reflex model.
Default models
| Model | Role | Approx. VRAM |
|---|
gemma4:26b | Reflex + deliberate chat | ~16 GB |
nomic-embed-text | Memory similarity embeddings | ~274 MB |
The embedding model loads on demand alongside the chat model — there is no separate embedding service. Both reflex and deliberate use the same weights with different token budgets, so only one model needs to be loaded at a time.
Optional models
| Model | Role | When to use |
|---|
gemma4:e4b | Fast reflex layer | Tight VRAM (~8 GB setups). Set as cognition.reflex_model in entity YAML. |
gemma3:1b | Dedicated noise model | Different character of inner voice. Set in soma.noise.model. Costs extra VRAM. |
Context window and VRAM
Larger context windows use more memory. The default max_context_tokens: 32768 (32K) is a good balance for 16 GB cards.
cognition:
max_context_tokens: 65536 # 64K — needs more VRAM headroom
Ollama settings
For single-GPU setups, these defaults prevent overcommitting. The ollama:reset npm script sets them automatically.
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_KEEP_ALIVE=60s
OLLAMA_CONTEXT_LENGTH=16384
OLLAMA_NUM_PARALLEL=1
MoE note
Gemma 4 uses a Mixture-of-Experts architecture. Active parameters per token are lower than the full model size, so actual VRAM usage during inference can be less than the raw parameter count suggests. Real-world fit depends on context length, thinking budget, quantization level, and concurrent platform activity.