Essay / Note
The agent evidence ledger is the missing companion to the authority map
An authority map says what an agent may do. An evidence ledger says why it has earned that authority, where trust is still provisional, and what should change after real operating evidence appears.
Yesterday’s post argued that a process map becomes useful for AI when it becomes an authority map.
The next question is what keeps that authority map honest after deployment.
This morning’s scan did not produce a single release worth chasing as news. The stronger live signal was a pattern: enterprise-agent discussion keeps moving toward governed workflow automation, audit trails, finance workflows, process intelligence, and orchestration across existing business systems.
That is the right direction.
But there is a trap in the way teams talk about this.
They say they need governance, observability, auditability, and controls. All true. Then they often turn those words into a compliance layer that sits beside the agent instead of an operating instrument that changes what the agent is allowed to do next.
The useful question is sharper:
What evidence should change the agent’s authority?
An authority map says what the agent may observe, prepare, recommend, execute, escalate, or never touch.
An evidence ledger says why those permissions are deserved, where they are still provisional, and what real-world operating data should cause them to expand, shrink, pause, or be redesigned.
For managers and builders, that ledger is becoming one of the most important artifacts in agent work.
What changed
The best live candidate this morning was:
Enterprise AI is moving from agent demos toward governed workflow automation with audit trails.
The scan surfaced the same pattern from several angles:
- agentic workflows in finance and reporting are being discussed through risk, governance, controls, and assurance rather than only productivity
- workflow-automation vendors increasingly frame agents as one participant inside a governed process, not as a magic worker floating above the process
- process intelligence and process mining keep showing up as ways to make enterprise work legible enough for AI to operate inside it
- coding-agent and enterprise-agent discussions are shifting from model selection toward operating discipline, handoffs, and review surfaces
That is useful, but as a standalone live post it would become too generic: “governance matters for agentic AI.”
True, but not enough.
The best backlog candidate was:
How to build an evidence ledger for each authority-map step.
That wins today because it gives the live governance discussion a concrete operating artifact.
If an authority map is the promise, the evidence ledger is the proof system.
Why authority needs evidence
The most dangerous version of agent deployment is not giving an agent too much authority on day one.
That is obviously risky.
The subtler danger is giving an agent provisional authority, watching it work for a while, and then changing permissions based on vibes.
A few successful runs create confidence.
A few impressive outputs make the team relax.
A few urgent requests tempt the manager to widen scope.
A stakeholder says, “It seems to be working. Can we just let it handle the next step too?”
That is how authority creep happens.
Not through one reckless decision, but through a series of small permission changes that are not tied to evidence.
The agent is allowed to draft.
Then it is allowed to send with approval.
Then it is allowed to send low-risk cases automatically.
Then the low-risk definition quietly expands.
Then exceptions are treated as edge cases rather than design feedback.
By the time something breaks, nobody can point to the exact evidence that justified the authority increase.
That is a management failure, not only a technical failure.
What people are overreacting to
People are overreacting to audit logs as if logs alone create control.
Logs are necessary. They are not sufficient.
A log tells you what happened.
An evidence ledger tells you what that history means for future authority.
There is a big difference.
A system can record every tool call, every prompt, every output, every approval, and every exception, while still leaving managers with no clear answer to the questions that matter:
- Should this agent keep its current authority?
- Which step is safe to automate further?
- Which step should be demoted back to preparation or recommendation?
- Which failure was a one-off and which failure reveals a bad workflow design?
- Which human approval is still valuable and which one has become rubber-stamping?
- Which input-quality problem should be fixed before the agent gets more autonomy?
Raw observability is a memory of events.
Governance needs interpretation.
The evidence ledger is where interpretation becomes operational.
What people are underreacting to
People are underreacting to the idea that agent authority should be earned continuously.
Not granted once.
Not trusted because the model is better.
Not expanded because the vendor shipped a new orchestration feature.
Earned, step by step, with evidence tied to the specific work being delegated.
This matters because agent performance is not a single global trait.
An agent may be excellent at gathering evidence but weak at interpreting policy exceptions.
It may draft clear customer replies but fail when the source record is incomplete.
It may classify routine tickets well but become brittle when two workflows overlap.
It may execute one system update reliably but mishandle the downstream communication.
A single pass/fail score hides this texture.
An evidence ledger preserves it.
It says:
At this step, for this kind of case, under these conditions, we have enough evidence to allow this kind of authority — and not more.
That is the granularity responsible agent management needs.
The evidence ledger version of an authority map
For every meaningful step in the authority map, I would add an evidence ledger with seven fields.
Not because every small workflow needs heavy bureaucracy.
Because without these fields, teams will eventually make authority decisions from scattered impressions.
1. Current authority level
Start by stating what the agent is currently allowed to do.
Use plain language.
For example:
- may observe tickets and source documents
- may prepare a summary and evidence packet
- may recommend a category and next action
- may draft a customer reply but not send it
- may update the internal status field after human approval
- may execute routine updates below a defined risk threshold
- must escalate exceptions, missing evidence, conflicting instructions, or high-value cases
This sounds obvious, but many teams cannot answer it crisply.
They know the agent is “in the workflow.”
They do not know exactly what authority it has at each point.
If the current authority level cannot be written in one or two sentences, it is probably not designed clearly enough to govern.
2. Evidence required to keep this authority
Authority should not only have expansion criteria.
It should have maintenance criteria.
What evidence proves the current permission level is still safe and useful?
Examples:
- the agent consistently cites the correct source records
- human reviewers accept the prepared packet without major missing evidence
- recommendation disagreement rates stay below a defined threshold
- exceptions are escalated cleanly rather than forced through the happy path
- errors are recoverable without customer, financial, legal, or operational harm
- the agent’s work reduces review time without reducing decision quality
The key is not to invent fake precision.
The key is to make the standard explicit enough that permission is not preserved by inertia.
3. Evidence required to expand authority
This is where many teams get sloppy.
They expand authority because the agent looks good, not because it has cleared a specific operating test.
A better ledger asks:
What would we need to see before this agent earns the next permission level?
For a preparation agent, expansion might require evidence that human reviewers rarely need to add missing records.
For a recommendation agent, expansion might require evidence that disagreements are understood, categorized, and concentrated in known edge cases.
For an execution agent, expansion might require evidence that rollback works, exception routing is reliable, and impact stays within agreed limits.
Expansion evidence should be tied to the next authority level, not to generic satisfaction.
The evidence required to move from observe to prepare is not the same as the evidence required to move from recommend to execute.
4. Evidence that should reduce authority
A serious ledger includes demotion criteria before trouble arrives.
This is uncomfortable, which is why it is useful.
Demotion evidence might include:
- repeated missing-source errors
- confident recommendations with weak evidence
- failure to escalate ambiguous cases
- rising human override rates
- repeated corrections in the same workflow step
- hallucinated system state
- use of stale documents or outdated policy
- customers, operators, or downstream teams discovering errors before the agent or reviewer does
The point is not to punish the agent.
The point is to protect the work.
If the ledger has no demotion criteria, the only realistic response to deteriorating performance is ad hoc debate.
That is too slow for real operations and too vague for good management.
5. Exception patterns
Exceptions are not just incidents.
They are curriculum.
A good evidence ledger tracks not only how many exceptions occurred, but what kind.
Useful categories might include:
- missing input
- conflicting records
- unclear policy
- unsupported request
- tool failure
- permissions failure
- unusual customer context
- high financial or reputational risk
- human disagreement with agent reasoning
- downstream correction after completion
This matters because different exception patterns imply different fixes.
Missing inputs may mean the intake process is broken.
Conflicting records may mean the system of record is unreliable.
Unclear policy may mean the agent is exposing a management ambiguity that humans were previously absorbing silently.
Tool failures may mean the automation layer is brittle.
Human disagreement may mean either the agent is wrong, the reviewer is inconsistent, or the decision rule is underspecified.
An exception log records the problem.
An evidence ledger connects the pattern to authority decisions.
6. Human review quality
Human review is often treated as a safety blanket.
It is not automatically one.
If reviewers are overloaded, unclear on criteria, or rubber-stamping outputs, the approval step may create the appearance of control without the substance.
So the ledger should track review quality too.
Ask:
- Did the reviewer have enough evidence to make a real decision?
- How often did reviewers change the agent’s work?
- Were corrections about facts, judgment, tone, policy, risk, or missing context?
- Did different reviewers disagree with each other?
- Did review time go down because the packet was better, or because people stopped looking carefully?
- Are reviewers catching issues before downstream users do?
This is especially important when teams plan to remove a human checkpoint.
Before removing review, inspect whether review was actually adding value.
If review was valuable, removing it is risk.
If review was low-value, the better question is whether the checkpoint should be redesigned before it is removed.
7. Next authority decision
The ledger should end with a decision, not a dashboard.
For each step, name the current management choice:
- keep authority unchanged
- expand authority under specific limits
- reduce authority temporarily
- redesign the workflow step
- fix input quality before reassessment
- improve review criteria
- add a stop line
- retire the agent from this step
This is where governance becomes management.
The goal is not to admire metrics.
The goal is to decide what the system should be allowed to do next.
A simple example
Imagine an agent inside a procurement workflow.
The process map says:
- intake purchase request
- check policy
- compare vendor options
- prepare recommendation
- route for approval
- update procurement system
- notify requester
The authority map says the agent may observe the request, prepare a policy evidence packet, recommend a vendor option, draft the approval note, but not update the procurement system or notify the requester.
The evidence ledger for the policy-check step might say:
- current authority: may prepare a policy evidence packet and recommend whether the request appears compliant
- keep-authority evidence: correct policy citation in at least 95% of reviewed routine cases; missing-evidence rate below a defined threshold; all ambiguous policy cases escalated
- expansion evidence: three review cycles showing low disagreement on routine cases, clear exception clustering, and no missed high-risk requests
- reduce-authority evidence: any missed high-risk policy exception, repeated stale-policy references, or rising reviewer overrides
- exception pattern: most issues come from ambiguous budget-owner approval rules
- review quality: reviewers frequently correct approval-owner interpretation, not factual sourcing
- next decision: do not expand execution authority; fix the approval-owner rule and update the evidence packet template first
Notice what changed.
The team did not simply ask, “Is the agent accurate?”
It asked whether the evidence supports a specific authority decision at a specific workflow step.
That is a much better management question.
What managers should do differently
If you manage AI work, stop asking only for demos, accuracy rates, or adoption updates.
Ask for the evidence ledger.
For any agent that participates in real work, ask:
- What authority does it currently have at each workflow step?
- What evidence justifies that authority?
- What evidence would let us expand authority?
- What evidence would force us to reduce authority?
- What exception patterns are teaching us about the workflow?
- What does human review still catch?
- What is the next authority decision?
This turns agent governance from a policy document into an operating rhythm.
It also changes the meeting.
Instead of asking, “How is the agent doing?” you ask:
What has the agent earned, what has it not earned, and what did the workflow teach us this week?
That is harder to fake.
It is also more useful.
What builders should do differently
If you build agent systems, do not treat evidence as an afterthought.
Design the ledger into the workflow from the start.
That means the system should make it easy to capture:
- the source records used
- the recommendation made
- the action taken or withheld
- the human decision
- the reason for disagreement
- the exception category
- the downstream correction, if any
- the authority decision that followed
This does not require a giant platform at first.
A well-structured table is better than a vague dashboard.
A lightweight review packet is better than a beautiful observability screen that no manager knows how to act on.
The artifact should support decisions about authority.
If it does not, it is probably monitoring theatre.
The practical test
Here is the test I would use before expanding any agent’s authority:
Can we point to the evidence that says this agent has earned the next permission level at this exact workflow step?
If the answer is no, do not expand authority yet.
Keep the agent useful, but keep it bounded.
Improve the evidence layer.
Fix the input problem.
Clarify the review criteria.
Redesign the stop line.
Run another cycle.
The aim is not to be timid.
The aim is to make autonomy earned instead of assumed.
That is how agent systems become more than impressive demos.
They become managed workers inside real workflows.
And managed workers need records of trust, not just records of activity.