Recipes

Prompt Optimization Lifecycle

Capture outcomes → optimize → diff → activate. A human-in-the-loop workflow for evolving agent prompts from real execution data.

Prompts drift. An agent that was well-tuned two weeks ago now misses cases, misroutes escalations, or over-hedges. The MuBit control plane ships an optimization loop that uses recorded outcomes to propose a better prompt, a diff view to review it, and a one-click approval to activate it — without touching deployed SDK code.

This recipe shows the end-to-end flow. Every SDK step below has a Console equivalent inline — use the console when you want human-in-the-loop review and the SDK when you want to automate or schedule. Both paths call the same control-plane endpoints and produce identical PromptVersion rows.

The loop at a glance

Run agents → Record outcomes → Optimize → Review diff → Activate
                                     ↑                        │
                                     └────── (next cycle) ────┘

Every step is a single control-plane call. You can wire this into CI, a cron, or trigger it manually from the console's Agent Card.

1. Record outcomes while agents run

Every interaction that ends with a judgeable result should call record_outcome (run-level) or record_step_outcome (per-step, for dense feedback). This is the signal the optimizer reads.

client.record_outcome(
    session_id=run_id,                  # falls back to the client's run_id if omitted
    reference_id=evidence_id,          # the specific fact / lesson / archive block the outcome is about
    outcome="success",                  # "success" | "failure" | "partial" | "neutral"
    signal=0.8,                         # -1.0..1.0
    rationale="Customer confirmed the refund was processed correctly",
    agent_id="triage",
)

For multi-step agents, also record per-step signal:

client.record_step_outcome(
    run_id=run_id,
    step_id="2026-04-17T09-12-route",
    step_name="routing",
    outcome="partial",
    signal=0.3,
    rationale="Routed to billing but should have gone to compliance",
    directive_hint="Check billing AND compliance scopes before routing",
    agent_id="triage",
)

💡Tip

The optimizer weighs failures (signal < 0) and the rationale / directive_hint fields heavily. Invest in writing short, specific rationales — they become the material the synthesised candidate is built from.

ℹ️Note

Console equivalent: outcomes are recorded from your agent code, not the console — the console reads them back under Agents → your agent → Runs (/app/projects/<pid>/agents/<aid>/runs). Even when you drive optimization entirely from the UI, the record_outcome / record_step_outcome call in your agent loop is still the signal source.

2. Trigger an optimization

When you have enough outcomes to form an opinion (empirically: ~10–20 outcomes with at least a few negatives), ask the control plane to propose a candidate.

resp = client.optimize_prompt(
    agent_id="triage",
    project_id=project_id,
)
 
candidate = resp["candidate"]
print(resp["optimization_summary"])   # human-readable rationale
print(resp["confidence"])              # 0..1
print(resp["activated"])               # False by default — human review first

ℹ️Note

Steering the synthesis model: you can override which model writes the candidate via the llm field (an LlmOverride), but only over the gRPC transport — the HTTP optimize endpoint (the SDK's default transport) ignores any override and uses the instance's default optimizer model, exactly like the console. To pass an override, construct a gRPC client and supply llm:

client = mubit.Client(transport="grpc")   # override is dropped on the default HTTP transport
resp = client.optimize_prompt(
    agent_id="triage",
    project_id=project_id,
    llm={
        "provider": "anthropic",
        "model": "claude-sonnet-4-6",
        "temperature": 0.2,
    },
)

The response includes:

candidate — a new PromptVersion row with status="candidate" and source="optimization".
optimization_summary — what the optimizer changed and why.
confidence — the optimizer's self-reported confidence.
activated — whether the candidate was auto-activated (default: false).

ℹ️Note

Console equivalent: open the agent's Prompts tab (/app/projects/<pid>/agents/<aid>/prompts) and click Suggest Optimization on the Active System Prompt card. A new row appears in the Version History table with status: candidate and source: optimization, auto-expanded to show the candidate prompt, and a pending-candidate banner appears at the top of the page. The console uses the instance's default optimizer model — as does the SDK over its default HTTP transport. To pick a different synthesis model, use the gRPC transport with an llm override (see above).

3. Review the diff

Never promote a candidate blind. Fetch the diff against the currently active version:

active = client.get_prompt(agent_id="triage")
diff = client.get_prompt_diff(
    agent_id="triage",
    version_a_id=active["version"]["version_id"],
    version_b_id=candidate["version_id"],
)
print(diff["diff_text"])   # unified diff format

Console equivalent: click Review on the pending-candidate banner, or Compare in the Version History row. That opens /app/projects/<pid>/agents/<aid>/compare/<vid> with the same diff_text rendered in a split view, the optimization_summary in a muted caption above the diff, and an Approve & Activate button at the top.

What to check:

Does the summary match the diff? If the summary says "tightened escalation criteria" but the diff rewrites the tone, the optimizer hallucinated.
Are edits localized? Small, targeted edits ship safely. A full rewrite needs a canary.
Does the outcome count justify the change? The optimizer can synthesize a confident-looking candidate from 3 outcomes. Wait for more data.

4. Shadow test (optional but recommended)

Before activating, run the candidate side-by-side with the active prompt on a known replay set. Use branching for reversibility:

# Snapshot current run so we can compare before / after
checkpoint = client.checkpoint(run_id=run_id, label="pre-candidate-evaluation")
 
# Run replay traffic. Capture outcomes for both branches.
# (Your replay harness, not shown.)

Or, for a controlled canary, activate the candidate for a fraction of traffic by routing some runs to a duplicated agent with agent_id="triage-canary" whose prompt is the candidate.

5. Activate the winner

Once you're satisfied, promote the candidate:

client.activate_prompt_version(
    agent_id="triage",
    version_id=candidate["version_id"],
)

Activation is atomic — in-flight runs continue with the old prompt; new runs see the new one. The previously active version transitions to retired and remains available for rollback.

ℹ️Note

Console equivalent: click Approve & Activate on the compare page, or Approve on the pending-candidate banner in the Prompts tab. The console flips the status badges, retires the prior active version, and returns you to the Prompts tab — no further confirmation step.

6. Rollback if something breaks

If the new prompt regresses, every prior version is still addressable. List versions, pick one, and reactivate:

versions = client.list_prompt_versions(agent_id="triage")
prior_active = next(
    v for v in versions["versions"]
    if v["status"] == "retired" and v["source"] != "rollback"
)
client.activate_prompt_version(
    agent_id="triage",
    version_id=prior_active["version_id"],
)

The newly activated version takes source="rollback" so your audit log reflects intent.

ℹ️Note

Console equivalent: every retired version stays in Version History on the Prompts tab. Click Compare on a retired row to confirm the diff, then Approve & Activate. The activation is recorded with source: rollback just like the SDK path.

Skill optimization

Exactly the same loop works for skills — swap the method names:

optimize_skill(project_id, skill_id) — like optimize_prompt, an llm override only applies over the gRPC transport
list_skill_versions(skill_id)
get_skill_diff(skill_id, version_a_id, version_b_id)
activate_skill_version(skill_id, version_id)

Skills include both parameters_schema and instructions in the diff, so review both sections of the unified diff.

ℹ️Note

Console equivalent: open a project's Skills tab → pick a skill (/app/projects/<pid>/skills/<sid>). The Active Definition card has separate editable fields for Description, Parameters Schema, and Instructions. Suggest Optimization creates a candidate; the compare page at .../compare/<vid> renders a unified diff across all three fields; Approve & Activate promotes it.

Automating the loop

A common pattern: run the optimize step nightly, but never auto-activate. Post the diff into a Slack channel or create a ticket for a human to approve.

# Cron: nightly per agent
import mubit
 
client = mubit.Client()
 
for agent_id in ("triage", "billing", "escalation"):
    resp = client.optimize_prompt(agent_id=agent_id, project_id=PROJECT)
    if resp["confidence"] < 0.6:
        continue                      # too speculative; skip
    candidate = resp["candidate"]
    active = client.get_prompt(agent_id=agent_id)
    diff = client.get_prompt_diff(
        agent_id=agent_id,
        version_a_id=active["version"]["version_id"],
        version_b_id=candidate["version_id"],
    )
    notify_slack(agent_id, resp["optimization_summary"], diff["diff_text"])

Projects, Agents, Skills, Prompts — the resource model behind the lifecycle.
Step-Level Outcomes — dense reward signal that feeds better optimizations.
Activity & Audit Trail — inspect what outcomes were available when the optimizer ran.

The loop at a glance

1. Record outcomes while agents run

2. Trigger an optimization

3. Review the diff

4. Shadow test (optional but recommended)

5. Activate the winner

6. Rollback if something breaks

Skill optimization

Automating the loop

Related pages