Operational checklist
| Area | Track this |
|---|---|
| Freshness | Ingest accepted-to-done latency |
| Retrieval | Query and context latency, weak-evidence rates |
| Learning loop | Reflection volume, outcome recording coverage, surfaced strategies |
| Memory quality | memory_health results, contradictions, stale entries |
| Compaction safety | Checkpoint cadence and checkpoint failures |
| Coordination | Handoff and feedback visibility across agents |
Consistency model
- Keep deterministic
run_id/session_idmapping across writes and reads. - Use
getContextrather than reconstructing large prompts manually. - Treat checkpoints as explicit lifecycle boundaries.
- Use
diagnoseandmemory_healthbefore changing retrieval prompts or weights.
LLM telemetry
MuBit tracks all internal LLM calls (ingestion routing, query synthesis, reflection, snapshots) with Prometheus metrics available at the/metrics endpoint.
| Metric | Labels | Description |
|---|---|---|
mubit_llm_calls_total | task, provider, model, success | Total LLM call count by task and outcome |
mubit_llm_call_duration_seconds | task, provider | Call latency histogram (buckets: 0.1s–30s) |
mubit_llm_tokens_total | task, provider, token_type | Token consumption (prompt vs completion) |
mubit_llm_retries_total | task, provider | Retry attempts due to rate limits or errors |
mubit_agent_degraded_total | agent, reason | Agent fallbacks to heuristic mode |
| Metric | Description |
|---|---|
mubit_storage_compaction_pending | Pending compaction work (0 = healthy) |
mubit_storage_write_stall | Whether writes are being throttled (0 or 1) |
mubit_disk_total_bytes | Total disk capacity |
mubit_disk_used_bytes | Disk space used |
mubit_disk_usage_pct | Disk usage percentage |
Failure modes and troubleshooting
| Symptom | Root cause | Fix |
|---|---|---|
| Learning appears inactive | Reflections exist but outcomes are never recorded | Record outcomes against reflected lessons/rules |
| Important details disappear after long runs | No checkpoint before compaction | Save checkpoints before summarization or window resets |
| Debugging memory quality is slow | No memory diagnostics in the workflow | Add memory_health and diagnose to incident review |
Next steps
- Review route-level contracts at Control HTTP reference.
- Apply the learning loop at Support agent memory loop.