Taint Tracking for Agent Tool-Calls

Kendall Clark · Pentad Labs · 2 June 2026 · PLRN-008

Abstract

Tool-call argument integrity in an agent system is an information-flow taint problem, and most of it is solvable by a deterministic, zero-LLM evaluator. The part that resists deterministic treatment is goal-alignment across a trajectory, and that part is the open boundary. Naming it precisely is most of our contribution.

The note starts from a published defense, AgentVisor, which interposes a trusted Visor between an agent and tool execution and audits each proposed tool call with a three-check protocol called STI: Suitability, Taint, and Integrity. AgentVisor runs STI with a second model acting as judge. The WunderOS interceptor architecture rules model-in-the-loop evaluators out of the policy path, since evaluators there are pure functions with byte-for-byte replay determinism.

The question is whether STI survives delamination from the model judge into deterministic, replay-safe evaluators. Two of the three checks survive. Suitability delaminates cleanly into a capability gate. Integrity is tractable as an argument-level provenance check. Taint, goal-alignment across a trajectory, is the one that degrades, and the honest result is that Taint may belong as a deferral to human review rather than an inline block.

1. The STI protocol and where it runs

AgentVisor treats the agent as untrusted and the tool boundary as the place of enforcement. Before any proposed tool call executes, a trusted Visor audits it against three checks.

Suitability asks whether the tool is permitted under the system role the agent was given.
Taint asks whether the implied goal of the call aligns with the user’s request, or whether a new goal has been smuggled in through content the agent ingested.
Integrity asks whether the tool arguments are consistent with the entities and constraints established by the user query and the prior history.

On failure the Visor raises a structured exception, and the agent gets one chance to self-correct before the call is refused.

AgentVisor performs all three checks with a second model acting as judge. The published protocol reports low attack-success rates while largely preserving task utility, which is the result that makes STI worth adopting. The judge is a language model, and the audit is a model inference on the hot path of every tool call.

The WunderOS interceptor architecture forbids that. The interceptor commits the policy path to pure functions with byte-for-byte replay determinism, and some of those functions are pushed down into bounded native kernels. A model inference is not replayable byte-for-byte and is not a pure function of its declared inputs, so a model judge cannot sit in the policy path. The protocol is attractive and the implementation is ruled out, which sets the question for the rest of the note: can STI be delaminated from the model judge into deterministic, replay-safe evaluators without losing what made it work?

The three checks delaminate unequally. Our contribution is saying which, and why, and at what cost.

2. Suitability delaminates cleanly

Suitability is a capability gate. The system role assigned to an agent fixes the set of tools that role may invoke, and that set is known at session start. A role-to-allowed-tool whitelist parsed once at session start is a pure function from (role, proposed tool) to a boolean, and it admits or refuses the call with no reference to the call’s arguments, the conversation, or any external content.

No model sits on this path. The whitelist is static configuration, the lookup is constant-time, and the decision replays byte-for-byte because it depends on nothing that varies between runs. A model judge contributes nothing to Suitability that the whitelist does not already capture, since the question is closed: either the role permits the tool or it does not. This is the cleanest of the three, and it is clean because the question has a finite, enumerable answer set established before the agent runs.

3. Integrity is an argument-level provenance check

Integrity asks whether the arguments of a proposed tool call are consistent with the entities and constraints the user established. The canonical failure makes the shape concrete. For example, a user asks the agent to send an email to a named recipient, the agent ingests some content during the task, and the recipient address in the proposed send_email call has been silently swapped for another. Nothing about the request authorized that recipient. The argument is inconsistent with the entities the user established, and that inconsistency is exactly an entity-consistency violation.

This delaminates onto argument-level provenance, the mechanism PACT develops for agent security. Each entity that appears in a tool argument is checked against the entities established by the user query and by prior turns of the structured session record. An argument value that traces to a user-established entity passes. An argument value that appears for the first time inside content the agent ingested, with no user-side origin, is flagged. The session record already holds the structured history, so the lookup is a provenance query over data the substrate maintains, and it runs as a pure function with no model on the path.

One risk is real and worth naming. The check assumes the entities in the session history are typed and addressable. If the sanitized history is natural-language prose rather than typed structure, the recipient address is buried in a sentence rather than sitting in a field, and the provenance lookup has nothing to key on.

The fix is a small local entity-extraction step that lifts addresses, names, and identifiers out of the prose before the lookup. That step is local and deterministic under a fixed configuration, and it does not reintroduce a judge. It does add a component, and it is where Integrity is most likely to miss an entity, since extraction recall bounds the check. Integrity is tractable, but it is tractable on typed history and only approximately tractable on prose.

4. Taint is the open boundary

Taint is the research bet. It asks whether the goal implied by a tool call still aligns with the user’s request, or whether the agent has adopted a goal injected through content it ingested. Goal alignment across a trajectory has no clean typed analog the way a recipient address does. A goal is not an entity in a field. It is a relation between what the user asked for and what the sequence of calls is now pursuing, and that relation is the thing a model judge was actually evaluating when it performed the Taint check.

The candidate deterministic treatment is goal-provenance. The WunderOS session keeps a causal record, a graph in which each tool call is a node linked to the nodes that caused it. The Taint check becomes a reachability query: every tool call must trace back through the causal record to a user-goal node, and a call whose causal ancestry leads only to ingested content, never to a user goal, is tainted. This is feasible because the causal record already exists, so the check adds a graph traversal rather than a new model. Goal-provenance catches the clear cases, where an injected instruction spawns a call with no user-goal ancestor at all.

It under-covers the hard cases. An injection that rephrases or extends a legitimate user goal produces a call whose causal ancestry does pass through a user-goal node, because the agent threaded the injected instruction through its reasoning about the real task, and the reachability query then returns true on a call that a human would judge tainted.

The indirect-injection case is where the residual gap concentrates. Goal-provenance answers “does this call descend from a user goal,” and the Taint check needs “does this call serve the user goal,” and those two questions diverge precisely when the attack is adaptive.

Two options remain for the gap.

One is a fixed-seed small local entailment model that judges whether the proposed call’s implied goal is entailed by the user request, run deterministically under a pinned seed and weights so that it replays. This keeps the check inline but reintroduces a model, a smaller and bounded one, into a path the architecture would rather keep model-free, and a pinned-seed model is replayable only as long as the seed, weights, and kernel are themselves pinned and versioned.
The other is to accept a residual gap, let goal-provenance handle the clear cases inline, and route the calls that pass provenance but carry alignment uncertainty to a human reviewer rather than blocking them in the loop.

The second option is a finding, not a failure. If goal-provenance covers the clear cases and the residual cases are the adaptive injections that defeat deterministic reachability, then Taint is telling you where it belongs. Suitability and Integrity are inline blocks because their questions have deterministic answers. Taint’s question does not fully reduce to a deterministic answer, so Taint belongs as a deferral to human review for the cases that survive goal-provenance, rather than as an inline block that either over-refuses or silently passes adaptive injections. The protocol does not have to run identically at every check. The right placement of a check follows from whether its question delaminates, and Taint’s does not delaminate all the way.

5. Context isolation is the free defense

A fourth ingredient comes for free and carries more weight against adaptive attacks than any of the three checks. The evaluator reads only the immutable, structured session snapshot. It never reads raw external tool output. An attacker who controls the content a tool returns controls what the agent ingests, but does not control what the evaluator sees, because the evaluator’s inputs are the typed session record and the proposed call, not the untrusted payload.

This is the dominant defense against adaptive attacks, since an adaptive attacker writes content designed to manipulate whatever audits the call, and context isolation removes the audit from the attacker’s reach.

It is independent of any model. It holds whether Taint runs as goal-provenance, as a pinned entailment model, or as a deferral, because it is a property of what the evaluator is allowed to read rather than of how the evaluator decides. The isolation is what keeps the deterministic checks honest under attack, and it costs nothing beyond the architectural commitment WunderOS already makes to immutable structured session state.

Self-correction stays where AgentVisor put it, as a retry. When a check fails, the harness returns the structured exception to the agent and lets it propose a corrected call once before refusing. This is a loop in the harness, not a new decision type in the interceptor. The interceptor’s job is to audit a proposed call and emit a verdict. Giving the agent a second attempt is the harness reacting to the verdict, and it requires no new evaluator and no new check.

6. What delaminates and what does not

Two of STI’s three checks survive delamination from a model judge into deterministic, replay-safe evaluators. Suitability becomes a static capability gate. Integrity becomes an argument-level provenance check over the structured session record, tractable on typed history and approximately tractable on prose given a local extraction step. Both run as pure functions with no model on the policy path, and both replay byte-for-byte.

Taint does not delaminate all the way. Goal-provenance, a reachability query over the causal record that already exists, handles the cases where an injected goal has no user-goal ancestor, and it under-covers the adaptive cases where the injection threads through a legitimate goal. The two repairs are a pinned-seed local entailment model that keeps the check inline at the cost of a bounded model, or a deferral that routes alignment-uncertain calls to human review. The deferral is the honest placement, because Taint’s question, whether a call serves the user’s goal, does not reduce to a deterministic answer the way the other two questions do. Context isolation, which restricts the evaluator to the immutable session snapshot, is the dominant defense against adaptive attacks and is independent of all three checks.

Our design claim is that STI is mostly a deterministic information-flow problem and that the part that is not is identifiable in advance. The open problem is Taint’s residual gap under adaptive indirect injection, and the claim about that gap is qualitative, since the WunderOS deterministic evaluators are a prototype with no published measurement and the only quantitative evidence in the literature is from the model-judge formulation this note delaminates away from.

AgentVisor (arXiv:2604.24118) is the anchor: semantic virtualization against prompt injection, with the STI protocol and a model judge auditing each proposed tool call. The present note adopts STI’s structure and replaces its judge.

PACT (arXiv:2605.11039) develops argument-level provenance for agent security, and it is the load-bearing anchor for the Integrity check. The recipient-swap failure is an argument-level provenance violation in PACT’s terms.

AgentArmor (arXiv:2508.01249) runs program analysis over the agent’s runtime trace, building a program-dependence graph and reasoning about flows on it. The causal record that goal-provenance queries is a close relative, specialized to goal reachability rather than general dependence.

RTBAS (arXiv:2502.08966) attaches information-flow-control labels to data and propagates them for taint and integrity. It is the closest formal lineage for treating both Integrity and Taint as label propagation, and the deterministic checks here are a substrate-specific realization of that idea.

Solver-aided policy compliance (arXiv:2603.20449) uses an SMT solver for numeric-constraint integrity, which is the deterministic-evaluator family the WunderOS checks belong to, applied to a different fragment of the constraint space.

The intellectual lineage is classical information-flow control and taint analysis. Treating tool-argument integrity as taint, and goal-alignment as a flow that can be tainted by ingested content, is the same move those literatures made for data and control flow in conventional programs.

PLRN 004 concerns itself with kinetic control of unmodified agents, the controllable surface being the agent’s environment. This note is concerned information-flow integrity of the tool calls an agent emits. The two are separate claims with separate mechanisms: one shapes what the agent does, the other audits what the agent’s calls carry.

A note on method

Drafted in conversation with Claude Opus 4.8 (Anthropic) as structured interlocutor and prose editor. The framework, claims, and architectural commitments are mine.

Kendall Clark · k@pentad.ai
—Great Falls, Virginia
June 2026