Cognition is how the entity thinks. Each turn flows through a decomposed pipeline, with a bounded agent loop for tool use and proactive context compaction for long conversations.Documentation Index
Fetch the complete documentation index at: https://docs.bumbleagi.com/llms.txt
Use this file to discover all available pages before exploring further.
Routing
The router decides between reflex and deliberate before inference begins. Both profiles use the same model weights and tool registry — the difference is token budget and thinking mode.- Reflex
- Deliberate
Fast path. Lower token budget (
reflex_max_tokens, default 512), thinking disabled. Used for casual chat, short questions, greetings. The router picks reflex when input is short, non-technical, and low-intensity — or when heuristics match slang, simple reactions, or brief questions under 80 characters.The router uses a heuristic-first approach with an optional reflex-model classifier. If the heuristic is uncertain, a quick LLM call (8 tokens, temperature 0.2) classifies the input as CHAT, GROUNDED, EXACT, or DEEP.
System prompt separation
System prompt (stable)
Identity, personality monologue, voice rules, tool declarations. Cached by hashed fingerprint — only recompiled when emotional state or knowledge changes significantly.Turn preamble (volatile)
Body state, procedural memory, project context, self-model, desktop session status. Injected as[Turn context] in the user message each turn.Agent loop
The bounded agent loop runs tool calls in parallel viaasyncio.gather when Gemma emits multiple calls in one step.
Tool execution
Tool execution
Each tool call is dispatched to the registry. Results are appended to the conversation. A repeat guard prevents the same tool from being called 3+ times consecutively. Tool output previews (280 chars, last 8) are tracked for the completion gate.
Post-tool nudge
Post-tool nudge
After each tool round, a short nudge confirms results are ready and flags any failures. An anti-repetition summary of what the user has already been told prevents the model from echoing itself. The nudge is deliberately minimal — the model has full conversation context and decides its own sequencing.
Strict Structured Output (Tool Forcing)
Strict Structured Output (Tool Forcing)
To prevent models from casually promising work in plain text without actually calling tools (e.g., “I’m on it! Let me search for that…”), Bumblebee’s inference layer enforces
tool_choice="required".Because conversational actions like say and end_turn are registered as literal tools, this forces the model into a strict JSON-only mode. The model cannot reply with raw conversational text; if it wants to speak, it must emit a say tool call. If it wants to act, it emits a work tool call. This structural constraint guarantees that the model must decide exactly what actions to take simultaneously, drastically reducing stalling and “teased” deliverables.Completion gate
Completion gate
The gate decides whether the agent loop may end for this user turn. It works on all user-visible text for the turn — the final assistant reply plus anything already sent via
say() or intermediate delivery — so a hollow mid-turn message cannot “count” while the final slot stays empty.Work tools vs agency tools: Only work tools count for grounding (filesystem, shell, code, web, MCP, etc.). Agency tools (think, say, wait, end_turn) do not. If work tools ran or the user explicitly demanded tool grounding, a small reflex judge (DONE: / CONTINUE:) checks that the visible reply actually reflects tool results and is not a thin acknowledgement.No work tools: If the turn used only agency tools (or plain text) and no work tool completed successfully, a second reflex judge — action adequacy — decides whether the user asked for tangible work (code, files, commands, live data, etc.) that was only promised or hand-waved rather than delivered. That check is intent-based (any language or tone), not a list of English catch-phrases, so the loop can continue with a nudge to use write_file, run_command, and the like when appropriate. This check also runs when the model explicitly calls end_turn without having used work tools — preventing premature turn endings where the model teased a deliverable via say() without following through.Heuristics still catch obvious cases (empty reply, progress-only chatter, token-limit stalls) before the judges run.Intermediate delivery
Intermediate delivery
On Telegram and Discord, user-facing messages are delivered via
say() during tool rounds. Text content alongside tool calls is treated as internal reasoning and is not forwarded to the user. When the model has communicated entirely through say() and no work tools were used, the final reply text is suppressed to prevent redundant echo messages.Loop limits
Loop limits
Tool continuation rounds are clamped to [0, 16]. Total agent steps cap at
max(6, min(25, 6 + rounds)). If the model hits length limits 3 consecutive times, it’s told to use write_file for long output.Context compaction
Long conversations exceed the model’s context window. The compaction system fires before inference when estimated tokens approach the limit — the model never hits the ceiling.When it triggers
Compaction fires when estimated tokens exceedmax_context_tokens * compaction_threshold_ratio (default 75%). Up to compaction_max_passes (default 3) rounds run until the context fits.
The four phases
Memory flush (first pass only)
An LLM reviews middle turns and extracts durable facts into
knowledge.md as JSON {title, body} objects. Skips locked sections, deduplicates against existing titles. Best-effort — if it fails, compaction proceeds.Prune old tool results
No LLM call. Tool outputs older than the protected tail are replaced with
[Old tool output cleared to save context space]. Tool results are the biggest token consumers and are usually redundant once interpreted.Find boundaries
The conversation splits into three regions: a protected head (first 2 messages), a middle (summarized and removed), and a protected tail (by token budget, minimum 12 messages). Boundaries align to avoid splitting tool_call / tool_result groups.