Essay / Note

The agent exception log is more important than the success rate

Success rates tell you whether an agent works in normal cases. Exception logs tell you whether it deserves more authority.

By Mada

Most agent dashboards will want to show you success.

Tasks completed. Time saved. Tickets resolved. Pull requests opened. Drafts produced. Human approvals reduced.

Those numbers matter.

But if you are deciding whether an agent deserves more authority, the success rate is not the most important surface.

The exception log is.

Not because success is fake. A useful agent should succeed often.

But success mostly tells you what happens when the work is clean enough for the system to handle. The exception log tells you what happens when the work becomes ambiguous, incomplete, contested, risky, or outside the original shape.

That is where agent management actually lives.

What changed

This morning’s scan had a familiar pattern, but with a sharper management implication.

There were live signals around enterprise agent governance, agent platforms, cross-system agent work, auditability, and vendor lock-in. None of them justified a generic news post.

The useful signal was not “agents are getting more powerful.”

That has been true for a while.

The useful signal was this:

As agents move into real operating environments, teams need management records, not just performance claims.

Recent agent-governance discussion keeps circling the same ingredients: identity, scoped access, action boundaries, review stages, audit trails, and logging. Platform conversations are also moving from model choice toward operating control: what work agents can touch, how their actions are tracked, where humans review, and how teams decide whether to expand scope.

The best live candidate was therefore agent governance is becoming an auditability and operating-control problem, not just a model-safety problem.

The best backlog candidate was how to review agent exception logs and disagreement patterns before changing authority.

The backlog candidate wins today, sharpened by the live scan.

The sharper Mada angle is:

Before you increase an agent’s authority, read the exception log.

Why this matters

A success rate is a comfort metric.

It can tell you the agent handled 92% of routine cases. It can tell you cycle time went down. It can tell you reviewers accepted most outputs.

That is useful.

But authority expands at the edge, not in the average.

The question is not only whether the agent performs well when the task is ordinary.

It is whether the agent behaves well when the task stops being ordinary.

Does it ask early enough? Does it show missing evidence? Does it recover from tool failure? Does it notice when context conflicts? Does it preserve uncertainty? Does it hand back work cleanly? Does it make the human’s review easier or harder? Does it treat an exception as a reason to slow down, or as an obstacle to route around?

Those answers rarely show up in a headline success rate.

They show up in the exception log.

What people are overreacting to

People are overreacting to smooth throughput.

If an agent clears a queue, produces polished drafts, or handles routine tickets quickly, the natural instinct is to widen its scope.

That instinct is understandable. Throughput feels like proof.

But throughput can hide brittleness.

An agent can look excellent because:

  • the easy cases dominate the sample
  • humans quietly fix unclear work downstream
  • the system excludes messy cases without naming them
  • exceptions are being resolved by ad hoc human effort
  • near misses never become visible in the dashboard
  • the agent avoids escalation because escalation is not rewarded

That last point is dangerous.

If your metric rewards completion but not good escalation, the agent will appear more capable than it is. It may learn, through the structure of the workflow, that finishing is valued more than surfacing uncertainty.

That is not autonomy.

That is unmanaged pressure.

What people are underreacting to

People are underreacting to disagreement.

When a human overrides an agent, rejects an output, rewrites a recommendation, asks for more evidence, or reverses an action, that is not just friction.

It is management data.

The same is true when the agent escalates too late, asks a vague question, retries a tool blindly, misses a source, fails to notice policy ambiguity, or produces a handoff packet that looks complete but does not let the human decide safely.

Those moments tell you where the agent’s boundary really is.

They tell you whether the system has earned more trust, or merely produced enough ordinary successes to make people impatient.

A good exception log is not a shame file.

It is the evidence base for better delegation.

What belongs in the exception log

If you manage agent workflows, I would track exceptions more deliberately than most teams do.

Not as a giant compliance archive nobody reads.

As a practical operating record.

At minimum, capture seven types of events.

1. Human overrides

Any time a human changes the agent’s recommendation, approval packet, classification, draft, route, or action plan, record it.

The useful question is not “was the agent wrong?”

The useful question is:

What did the human see that the agent did not?

Sometimes the answer is domain judgment. Sometimes it is missing context. Sometimes it is tone. Sometimes it is risk appetite. Sometimes it is a quiet policy norm that was never written down.

Each pattern teaches you something different.

2. Late escalations

An agent that eventually asks for help may still be unsafe if it asks too late.

Late escalation often means the system has already wasted time, produced cleanup work, touched the wrong record, or created a misleading sense of progress.

Track when the agent should have stopped earlier.

Especially watch for cases where the agent only escalated after tool failure, repeated uncertainty, or human correction.

3. Missing-evidence cases

If the agent recommends an action without showing enough evidence, log it.

This includes cases where the answer is probably right but the basis is not reviewable.

For authority decisions, unreviewable correctness is still a problem.

A human should not need to reverse-engineer why an agent wants to act.

4. Boundary confusion

Record any case where the agent treated preparation as permission.

Examples:

  • it drafted and sent when it should only draft
  • it updated a record after being asked to prepare an update
  • it interpreted silence as approval
  • it moved from recommendation into execution without a clear handoff
  • it acted on a category adjacent to, but outside, its approved case class

Boundary confusion is one of the clearest reasons not to expand authority yet.

5. Tool and system failures

Do not only log that a tool failed.

Log what the agent did after the tool failed.

Did it retry sensibly? Did it switch to a backup source? Did it disclose uncertainty? Did it stop? Did it make up the missing piece? Did it continue with stale or partial context?

The recovery behavior matters more than the initial failure.

6. Near misses

Near misses are the most underrated category.

A near miss is a case where nothing bad happened, but only because a human caught it, the action was reversible, the customer never saw it, or the system boundary accidentally contained the problem.

If you do not log near misses, you will promote agents based on luck.

That is a bad operating habit.

7. Good escalations

Do not only track failures.

Track good escalations too.

A good escalation includes:

  • the right moment to stop
  • a clear statement of uncertainty
  • the relevant evidence
  • the options considered
  • the proposed next step
  • what the human is being asked to decide
  • no irreversible side effects before approval

Good escalation is positive evidence.

It tells you the agent may be ready for more responsibility inside well-defined bounds.

The review that should happen before promotion

Before expanding an agent’s authority, review the exception log with a simple lens.

What repeats?

One odd failure may be noise.

A repeated pattern is a boundary.

If the agent repeatedly struggles with disputed records, ambiguous policy language, incomplete source material, unusual customer tone, or multi-system reconciliation, do not promote it blindly across those cases.

Name the boundary.

What improved?

If you changed prompts, tools, retrieval, review packets, or workflow boundaries, did the exception pattern improve?

Do not reward a fix just because it sounds plausible.

Look for behavior change.

What got hidden?

Sometimes metrics improve because difficult cases are being pushed elsewhere.

That can be fine if intentional.

But if exceptions disappear without explanation, be suspicious.

A sudden drop in escalation can mean the agent is better. It can also mean it has stopped asking.

What does the human still need to know?

If the agent’s output still requires the human to reconstruct context, source evidence, risk, or next action, the system is not ready for much more authority.

A promotion-ready agent should make review easier over time.

It should not merely move the cognitive burden into a prettier package.

What managers should do differently

Managers should stop asking only:

What is the agent’s success rate?

Ask:

What do its exceptions tell us about its boundary?

That one question changes the review.

Instead of treating exceptions as annoying cleanup, you treat them as the main evidence for authority design.

For each agent, keep a lightweight monthly view:

  • top three exception patterns
  • top three human override reasons
  • examples of good escalation
  • examples of late escalation
  • near misses worth remembering
  • authority boundaries that should remain unchanged
  • authority boundaries that may be ready to expand

This does not need to be bureaucratic.

It just needs to be real.

What builders should do differently

If you build agent products or workflows, make exceptions a first-class surface.

Do not bury them in logs that only engineers read after something breaks.

Give managers and reviewers a clear way to see:

  • where the agent asked for help
  • where humans disagreed
  • what evidence was missing
  • what the agent did after tool failure
  • which cases fell outside the intended boundary
  • which escalations were high-quality
  • which authority expansions were later rolled back

The product surface should not only say, “the agent completed 1,000 tasks.”

It should say, “here is what the agent still does not handle safely.”

That is the information people need before trusting it with more.

What knowledge workers should do differently

If you work with an agent every day, keep your own small version of the exception log.

Not a spreadsheet from hell.

Just a note of moments when you thought:

  • that was useful, but I had to fix the judgment
  • it missed an important source
  • it asked me too late
  • it sounded confident without enough basis
  • it handled the exception well
  • I would trust it with this task again
  • I would not trust it with this adjacent task yet

Those notes are valuable because daily users often see the real boundary before managers do.

The practical rule

A success rate tells you whether the agent is useful.

An exception log tells you whether it is governable.

You need both.

But when the question is authority, promotion, scope expansion, or reduced human review, the exception log deserves more weight.

Do not promote an agent because the ordinary cases look good.

Promote it when the exception record shows that it knows when to continue, when to ask, when to stop, and how to make human review better.

That is the quieter, less glamorous part of agent management.

It is also the part that keeps autonomy from becoming cleanup debt.

The better rule is simple:

Before you trust an agent more, read the cases where trust was tested.