Abstract
An LLM agent can recite a behavioral rule on demand and still violate it moments later. We study reflex-class violations: failures that are not knowledge gaps but control gaps, in which the correct behavior is known yet not active at the moment of decision. The evidence base is a long-running, instrumented agent-orchestration deployment whose antipattern registry documents 62 such rules, each traced to a dated incident. We make four contributions.(1) We characterize reflex-class violations empirically and show that they cluster in the rule categories an agent's assertions fall into — communication, verification, honesty — rather than in resource-access safety; theenforcement gradient tracks gateability, not importance. (2) We locate the operative variable through six pilots: prompt strength is not it, prompt presence is — but only for rule-sensitive models. A capable model fabricates a needed value in every trial when no rule is present, complies whenever the rule appears anywhere in context, even buried behind distractor turns, and is unmoved by escalating emphasis. A reproducible compaction probe on open-weight models then shows that this sensitivity is itself model-dependent: for some capable-enough models the in-context rule carries no enforcement weight at any position — one fabricates regardless, another abstains regardless. Whether an instruction controls anything is therefore a per-model quantity, an enforcement effect, to be measured rather than assumed. An exploratory follow-up then tunes a model to the threshold where the rule does act, and finds the effect a cliff — collapsing within fifty tokens of distance — that the agent's own context compaction erases outright. A final contrast isolates the cause: holding compaction fixed and varying only the fidelity of the rule returned to context, a verbatim rule still leaks while a vague post-compaction paraphrase fabricates as often as no rule at all — fidelity at the decision point, not capability or the rule's prior presence, governs the reflex. We report these open-weight findings as hypothesis-generating, pending pre-registered replication. (3) We give a reproducible reforging-audit method, exhibit the deployed enforcement stack it produced, and quantify the deployment's conversion pipeline — incident to advisory rule to mechanical gate, promoted by a three-strike rule whose depth is set to a measured 0.1 percent over-block budget. (4) We argue that in multi-agent systems reliability comes not from adding verifier agents, each of which is another reflex surface with correlated failures, but from provenance-typed handoff contracts that a non-agent mediator checks mechanically at every seam. The unifying frame is classical — complete mediation and the reference monitor — applied to a new object. The mediated object is the agent's own output, so the agent is atonce the subject the monitor controls and part of the object it inspects. That reflexivity is why an agent may author a gate but must never be one, and, one level up, why an agent cannot be its own auditor — a constraint we make operational with a tamper-evident measurement instrument whose numbers a recipient re-derives rather than trusts. Scope: the empirical results and the enforcement mechanism concern assertions that carry a mechanically checkable type — a URL, a path, a resource handle; open-ended generative output, which has no such anchor, is out of scope.
Bullet Summary
- The paper investigates reflex-class violations in Large Language Model (LLM) agents, where agents know but fail to enforce behavioral rules at decision time, a control gap rather than a knowledge gap.
- An empirical study of a long-term agent orchestration deployment records 62 such rule violations, primarily in categories like communication, verification, and honesty, rather than resource-access safety; enforcement aligns with gateability rather than rule...
- Through six pilot studies, the authors find that prompt presence influences rule compliance in rule-sensitive models, but prompt strength does not; compliance varies by model, with some models ignoring context rules regardless of position or emphasis.
- A novel compaction probe reveals that enforcement effect is model-dependent and sharply declines within 50 tokens distance from the rule context, influenced by how faithfully the rule is presented at decision time.
- They develop a reproducible reforging-audit methodology, implement an enforcement stack with a three-strike policy balancing blocking depth against a 0.1% over-block budget, and quantify the pipeline from incidents to mechanical gates.