Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.bumbleagi.com/llms.txt

Use this file to discover all available pages before exploring further.

Bumblebee uses Ollama for local inference. VRAM requirements depend on which models you run and how large a context window you need.

GPU VRAM guide

Example GPUs: RX 7600 8 GB, RTX 3050 8 GB, Arc A770 8 GBSmaller or quantized models only. Use aggressive quantization or point reflex at gemma4:e4b in entity YAML to lighten load. CPU-only via Ollama works for experiments but expect slow turns.

Default models

ModelRoleApprox. VRAM
gemma4:26bReflex + deliberate chat~16 GB
nomic-embed-textMemory similarity embeddings~274 MB
The embedding model loads on demand alongside the chat model — there is no separate embedding service. Both reflex and deliberate use the same weights with different token budgets, so only one model needs to be loaded at a time.

Optional models

ModelRoleWhen to use
gemma4:e4bFast reflex layerTight VRAM (~8 GB setups). Set as cognition.reflex_model in entity YAML.
gemma3:1bDedicated noise modelDifferent character of inner voice. Set in soma.noise.model. Costs extra VRAM.

Context window and VRAM

Larger context windows use more memory. The default max_context_tokens: 32768 (32K) is a good balance for 16 GB cards.
cognition:
  max_context_tokens: 65536   # 64K — needs more VRAM headroom

Ollama settings

For single-GPU setups, these defaults prevent overcommitting. The ollama:reset npm script sets them automatically.
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_KEEP_ALIVE=60s
OLLAMA_CONTEXT_LENGTH=16384
OLLAMA_NUM_PARALLEL=1
npm run ollama:reset

MoE note

Gemma 4 uses a Mixture-of-Experts architecture. Active parameters per token are lower than the full model size, so actual VRAM usage during inference can be less than the raw parameter count suggests. Real-world fit depends on context length, thinking budget, quantization level, and concurrent platform activity.