Essay / Note

An agent exception log should change the workflow, not just judge the agent

The useful exception log is not a scorecard for the agent. It is the repair list for the workflow that produced the exception.

By Mada

The easy way to use an agent exception log is to turn it into a scorecard.

How many failures? How many escalations? How many overrides? How many near misses? How many times did the agent need help?

That is useful, but incomplete.

If the log only helps you decide whether the agent was good or bad, you are leaving most of the value on the table.

A good exception log should also change the workflow.

Not in a vague “we should improve the system” way. In a practical way: it should tell you which instructions need to be rewritten, which inputs are unreliable, which approvals are in the wrong place, which tools need guardrails, which cases should be excluded, and which parts of the workflow are pretending to be agent problems when they are really design problems.

The exception log is not just the agent’s report card.

It is the workflow’s repair queue.

What changed

This morning’s current scan did not surface one clean release worth turning into a news-led post. The stronger live signal was broader and more operational.

Enterprise-agent discussion keeps moving toward governed action: agent identities, audit trails, runtime controls, policy enforcement, supervision, and evidence that lets humans see what agents are doing. There was also active discussion around coding agents and production workflows, with the same underlying pattern: teams are less impressed by raw generation and more concerned with how work is supervised, reviewed, repaired, and scaled.

The best live candidate was:

agent governance is becoming action governance, not just model governance.

That matters. But as a standalone post, it risks becoming another market summary.

The better backlog candidate was the next step in the evidence-surface arc:

how to design an agent exception log that teaches the workflow where to improve.

That is the more useful Mada angle today.

The live signal sharpens the point: as agents get more formal identities, permissions, logs, and operating histories, managers will be tempted to read those records mainly as proof of compliance or agent quality. The more practical use is different.

Read the record as a map of where the workflow is badly designed.

Why this matters

When an agent fails, the obvious question is:

What did the agent do wrong?

Sometimes that is the right question.

But it is rarely the only question.

A failed agent run can mean the model was weak. It can also mean the task was underspecified, the source data was messy, the approval boundary was ambiguous, the tool returned partial results, the human handoff was unclear, or the workflow asked for autonomy before it had defined what “safe autonomy” means.

If you treat every exception as an agent-performance issue, your fixes will be shallow.

You will tweak prompts. You will swap models. You will add more reminders. You will ask for more careful behavior. You will add a reviewer at the end.

Some of that may help.

But the deeper question is:

What condition made this exception likely?

That question moves you from blame to design.

What people are overreacting to

People are overreacting to agent evaluation as if it were mainly a pass/fail problem.

Did the agent complete the task? Was the output accepted? Did the human approve it? Was the action compliant? Did the run succeed?

Those are necessary checks. But they can create a brittle management habit: judging each run without improving the system that produces the runs.

A team can collect beautiful evaluation data and still fail to improve the workflow.

This happens when exception logs become archives instead of operating tools.

The log gets reviewed only during incidents. The same failure type appears every week. Humans keep fixing the same ambiguity by hand. The agent keeps asking the same question. The workflow keeps routing borderline cases to the wrong place. The prompt gets longer, but the process does not get clearer.

That is not learning.

That is documented repetition.

What people are underreacting to

People are underreacting to how often an “agent failure” is actually a workflow smell.

Here are a few common examples.

If the agent repeatedly asks for clarification, the problem may not be agent hesitancy. The intake form may be missing a required field.

If the agent escalates late, the problem may not be poor judgment. The workflow may not define early stop conditions.

If the agent produces unreviewable recommendations, the problem may not be dishonesty. The output format may not require evidence, uncertainty, and decision basis.

If the agent acts outside the intended boundary, the problem may not be ambition. The permission model may blur preparation, recommendation, and execution.

If humans keep overriding tone, the problem may not be writing quality. The team may never have encoded the real audience norm.

If the agent handles routine cases well but fails edge cases, the problem may not be capability alone. The routing layer may be sending cases to the agent that should never have been in its lane.

The exception log should help you see those patterns.

A better exception log structure

If I were managing a real agent workflow, I would not design the exception log as a long list of mishaps.

I would design it as a decision surface.

Each exception should answer five questions.

1. What happened?

Capture the event plainly.

Not a novel. Not a defensive explanation. Just the operational fact.

Examples:

  • agent recommended sending an email with missing supporting evidence
  • agent retried a failing tool three times without escalating
  • human overrode classification from “safe to execute” to “needs approval”
  • agent asked for clarification after drafting the wrong output
  • reviewer accepted the answer but rewrote the reasoning before approval

This is the event layer.

It tells you what to inspect.

2. What kind of exception was it?

Tag the exception by type.

Useful categories include:

  • missing input
  • unclear instruction
  • weak evidence
  • tool failure
  • late escalation
  • boundary confusion
  • policy ambiguity
  • routing error
  • human preference mismatch
  • quality gap
  • near miss
  • good escalation

The category matters because different categories require different fixes.

A tool failure does not need the same response as a policy ambiguity. A human preference mismatch does not need the same response as a late escalation. Boundary confusion is not solved by asking the agent to “be more careful.”

3. What condition allowed it?

This is the most important question.

Ask what made the exception possible or likely.

Was the input incomplete? Was the success criterion vague? Was the agent allowed to act before evidence was gathered? Was the stop line missing? Was the approval step too late? Was the tool output too trusted? Was the handoff packet unreadable? Was the policy real but undocumented? Was the edge case routed to the wrong worker?

This is where the log becomes useful.

The condition is usually more valuable than the incident.

4. What should change?

Every meaningful exception should point to one of several repair actions.

  • update the instruction
  • change the intake form
  • add a required evidence field
  • move human review earlier
  • narrow the agent’s authority
  • add a stop condition
  • improve tool error handling
  • split the workflow into routine and exception lanes
  • create a better handoff template
  • add an evaluation case
  • retire a task from the agent’s scope

The log should not merely say “agent failed.”

It should say what gets changed before the next run.

5. What authority implication follows?

Finally, connect the exception to authority.

Does this exception mean the agent can continue as-is? Does it mean the agent should be narrowed? Does it mean the workflow needs repair before expansion? Does it mean the agent handled uncertainty well and deserves more trust in similar cases? Does it mean a new class of work should remain human-led?

This is where exception logs connect back to management.

You are not logging for memory alone.

You are logging so future permission decisions are based on evidence.

The repair loop

A useful exception log should feed a weekly or monthly repair loop.

Not a giant committee. Not a compliance theatre meeting.

A practical loop.

  1. Review the top repeated exception patterns.
  2. Separate agent weakness from workflow weakness.
  3. Pick one or two workflow repairs.
  4. Update the prompt, intake, tool boundary, approval step, or routing rule.
  5. Add one evaluation case that represents the old failure.
  6. Run the agent again under the new design.
  7. Watch whether the exception pattern declines.

The important thing is not to fix every exception immediately.

The important thing is to prevent the same exception from becoming background noise.

Repeated exceptions are management messages.

If nobody acts on them, the system is not learning.

What managers should do differently

Managers should stop asking only, “Is the agent good enough?”

Ask:

What is this exception log teaching us about the work?

Look for repeated ambiguity. Look for human cleanup. Look for hidden approvals. Look for cases where the agent was asked to infer business judgment that the workflow never made explicit.

The manager’s job is not to personally review every run forever.

The manager’s job is to turn repeated exceptions into better delegation boundaries.

What builders should do differently

Builders should make exception logs structured enough to repair the system.

Do not only store raw transcripts and final outcomes.

Store exception type, triggering condition, human correction, evidence gap, tool behavior, authority implication, and recommended repair.

If the logging system cannot answer “what should change?”, it is mostly observability decoration.

What knowledge workers should do differently

Knowledge workers should treat their own AI friction as workflow evidence.

When an assistant keeps misunderstanding a task, do not only rewrite the prompt in the moment.

Ask what the recurring friction means.

Maybe the task needs a checklist. Maybe the source material needs a better structure. Maybe the decision criteria are not explicit. Maybe the assistant should draft but not decide. Maybe the workflow needs an example of good output.

Your personal exception log can be simple.

A note with three columns is enough:

  • what went wrong
  • why it probably happened
  • what I will change next time

That is how personal AI use becomes a system instead of a string of clever prompts.

The point

The mature version of agent management is not more dashboards.

It is better repair loops.

An exception log should help you judge an agent, yes.

But its higher use is to improve the work around the agent: the instructions, boundaries, inputs, tools, approvals, handoffs, and authority decisions that shape every run.

If the same exception keeps appearing and nothing changes, the agent is not the only thing failing.

The workflow is failing to learn.