Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.bumbleagi.com/llms.txt

Use this file to discover all available pages before exploring further.

Cognition is how the entity thinks. Each turn flows through a decomposed pipeline, with a bounded agent loop for tool use and proactive context compaction for long conversations.

Routing

The router decides between reflex and deliberate before inference begins. Both profiles use the same model weights and tool registry — the difference is token budget and thinking mode.
Fast path. Lower token budget (reflex_max_tokens, default 512), thinking disabled. Used for casual chat, short questions, greetings. The router picks reflex when input is short, non-technical, and low-intensity — or when heuristics match slang, simple reactions, or brief questions under 80 characters.
The router uses a heuristic-first approach with an optional reflex-model classifier. If the heuristic is uncertain, a quick LLM call (8 tokens, temperature 0.2) classifies the input as CHAT, GROUNDED, EXACT, or DEEP.

System prompt separation

System prompt (stable)

Identity, personality monologue, voice rules, tool declarations. Cached by hashed fingerprint — only recompiled when emotional state or knowledge changes significantly.

Turn preamble (volatile)

Body state, procedural memory, project context, self-model, desktop session status. Injected as [Turn context] in the user message each turn.

Agent loop

The bounded agent loop runs tool calls in parallel via asyncio.gather when Gemma emits multiple calls in one step.
Each tool call is dispatched to the registry. Results are appended to the conversation. A repeat guard prevents the same tool from being called 3+ times consecutively. Tool output previews (280 chars, last 8) are tracked for the completion gate.
After each tool round, a short nudge confirms results are ready and flags any failures. An anti-repetition summary of what the user has already been told prevents the model from echoing itself. The nudge is deliberately minimal — the model has full conversation context and decides its own sequencing.
To prevent models from casually promising work in plain text without actually calling tools (e.g., “I’m on it! Let me search for that…”), Bumblebee’s inference layer enforces tool_choice="required".Because conversational actions like say and end_turn are registered as literal tools, this forces the model into a strict JSON-only mode. The model cannot reply with raw conversational text; if it wants to speak, it must emit a say tool call. If it wants to act, it emits a work tool call. This structural constraint guarantees that the model must decide exactly what actions to take simultaneously, drastically reducing stalling and “teased” deliverables.
The gate decides whether the agent loop may end for this user turn. It works on all user-visible text for the turn — the final assistant reply plus anything already sent via say() or intermediate delivery — so a hollow mid-turn message cannot “count” while the final slot stays empty.Work tools vs agency tools: Only work tools count for grounding (filesystem, shell, code, web, MCP, etc.). Agency tools (think, say, wait, end_turn) do not. If work tools ran or the user explicitly demanded tool grounding, a small reflex judge (DONE: / CONTINUE:) checks that the visible reply actually reflects tool results and is not a thin acknowledgement.No work tools: If the turn used only agency tools (or plain text) and no work tool completed successfully, a second reflex judge — action adequacy — decides whether the user asked for tangible work (code, files, commands, live data, etc.) that was only promised or hand-waved rather than delivered. That check is intent-based (any language or tone), not a list of English catch-phrases, so the loop can continue with a nudge to use write_file, run_command, and the like when appropriate. This check also runs when the model explicitly calls end_turn without having used work tools — preventing premature turn endings where the model teased a deliverable via say() without following through.Heuristics still catch obvious cases (empty reply, progress-only chatter, token-limit stalls) before the judges run.
On Telegram and Discord, user-facing messages are delivered via say() during tool rounds. Text content alongside tool calls is treated as internal reasoning and is not forwarded to the user. When the model has communicated entirely through say() and no work tools were used, the final reply text is suppressed to prevent redundant echo messages.
Tool continuation rounds are clamped to [0, 16]. Total agent steps cap at max(6, min(25, 6 + rounds)). If the model hits length limits 3 consecutive times, it’s told to use write_file for long output.

Context compaction

Long conversations exceed the model’s context window. The compaction system fires before inference when estimated tokens approach the limit — the model never hits the ceiling.

When it triggers

Compaction fires when estimated tokens exceed max_context_tokens * compaction_threshold_ratio (default 75%). Up to compaction_max_passes (default 3) rounds run until the context fits.

The four phases

1

Memory flush (first pass only)

An LLM reviews middle turns and extracts durable facts into knowledge.md as JSON {title, body} objects. Skips locked sections, deduplicates against existing titles. Best-effort — if it fails, compaction proceeds.
2

Prune old tool results

No LLM call. Tool outputs older than the protected tail are replaced with [Old tool output cleared to save context space]. Tool results are the biggest token consumers and are usually redundant once interpreted.
3

Find boundaries

The conversation splits into three regions: a protected head (first 2 messages), a middle (summarized and removed), and a protected tail (by token budget, minimum 12 messages). Boundaries align to avoid splitting tool_call / tool_result groups.
4

Structured summary

The middle turns are summarized with a fixed template: Goal, Constraints, Progress, Decisions, Emotional Context, Critical Context, Next Steps. On re-compression, the previous summary is updated rather than rewritten — information accumulates across compactions.
Token estimation uses a character-based heuristic: len(text) // 4 + 10 per message. Fast, dependency-free, slightly over-estimates for English — compaction triggers early rather than late. No external tokenizer required.

Configuration

cognition:
  max_context_tokens: 32768
  history_compression:
    enabled: true
    compaction_threshold_ratio: 0.75
    compaction_target_ratio: 0.20
    compaction_protect_last_n: 12
    compaction_protect_first_n: 2
    compaction_max_passes: 3
    compaction_flush_to_knowledge: true