multiagentsecurity.ai

Select article

Cybersecurity in Autonomous AI Robotics: A Review of Emerging Threats, Adversarial Attacks, and Mitigation Techniques

Crossref · Center of Artificial Intelligence journal-article Crossref Governance and Policy Orchestration Risk Benchmarks and Evaluation

Shuruq Khalid Abdulredha

Published 2026-06-07

Venue: Center of Artificial Intelligence

DOI: 10.65591/cai-143-2026

Open Source Record

Abstract

Intelligent robotic systems that utilize artificial intelligence (AI), and have been expanding into high-risk applications (e.g., health care, manufacturing/industrial automation, transportation/smart mobility, etc.), require effective cybersecurity measures to maintain both safe operation and dependability. Compared with typical cyber-physical systems, advanced robotic systems include multiple layers (sensing, control, communications, middleware, and/or AI-based decision support) which create a complex and highly connected attack vector. Due to this increased complexity, these types of systems are vulnerable to a wide range of cyber-security threats including; network breaches/intrusions, manipulated sensors/command inputs, firmware backdoor vulnerabilities, adversarial machine-learning attacks, large language model (LLM) exploits/misuse, vulnerabilities in middle ware solutions, and supply chain-based compromises. Each type of threat has the potential to cause unsafe physical actions by the robot, loss of privacy for individuals involved in the use of the robot or related services, loss of availability/service failure for the robot/system/equipment, and cascaded failures within the entire robotic ecosystem. While existing defensive measures (secure communication protocols, runtime monitoring/perception hardening of robots, protection provided by robot operating system protections/middleware security framework) demonstrate positive results in reducing these risks, there is still much work needed particularly at the areas of adaptive defensive capabilities/system-wide security semantics and standardized evaluation metrics for assessing cyber-resilience in AI-enabled robotic systems. This paper provides an all-encompassing taxonomy of threats to robotic cybersecurity/attack vectors and evaluates and analyzes both attack surfaces and defense mechanisms. Additionally, this paper will provide recommendations for addressing identified knowledge gaps and possible paths forward for developing cyber-resilient AI-enabled robotic systems.

Bullet Summary

AI-powered autonomous robotic systems operate across multiple interconnected layers (sensing, control, communications, middleware, AI decision-making), increasing their attack surface and cybersecurity vulnerabilities.
Key threats include network intrusions, sensor manipulation, firmware backdoors, adversarial machine learning attacks, large language model exploits, middleware vulnerabilities, and supply chain compromises, each posing risks to safety, privacy, and system...
Cyber attacks can cause unsafe robotic behavior, loss of privacy, system failures, and cascading disruptions across robotic ecosystems, highlighting the critical need for robust security measures.
Existing defenses include secure communication protocols, middleware security frameworks, runtime monitoring, adversarial training, and AI-driven intrusion detection; however, these are often fragmented and lack comprehensive cross-layer integration.
The paper provides a comprehensive taxonomy of robotic cybersecurity threats and defense mechanisms, detailing attack surfaces and mitigation strategies across system layers.

Select article

The Governance Gap in Agentic Memory

Merged record merged scholarly record OpenAlex Governance and Policy Trust and Identity

Andrew Crenshaw

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20571518

Open Source Record

Abstract

The Governance Gap in Agentic Memory - a position paper proposing Substrate-Lens-Frame (SLF), a sovereign, auditable memory protocol for AI agents. AI agents now run on persistent memory, and that memory has become its own layer in the stack. Almost all of the effort in that layer goes to one question: how well can the system recall the right fact at the right time? That work is critically important. It leaves a second question unanswered, and it is the one that decides whether agent memory can be trusted with anything that matters: governance. Today's systems can recall a fact, but they cannot reliably say who is allowed to see it, how its meaning changes by role and jurisdiction, whether two stored facts contradict each other, or what was disclosed to whom. This paper names that gap, argues that it is structural, and proposes a protocol that addresses it. The proposal is Substrate-Lens-Frame (SLF), built around one operational primitive, render(substrate, lens, frame) -> receipt: a fact carries its own access rules; a lens reads it through a consumer-scoped projection that cannot widen those rules; a frame binds each action to an authorization; and every operation emits a payload-free signed receipt. This deposit is the position paper (PDF, CC BY 4.0). The Apache-2.0 reference implementation slf-core is archived separately (see Related works), with a companion Sovereign Personal Agent architecture (design) and a recovery-path prototype. Author: Andrew Crenshaw (ORCID 0009-0006-6459-0187), Lexenne. Cite as: Crenshaw, A. (2026). The Governance Gap in Agentic Memory. Zenodo. https://doi.org/

Bullet Summary

The paper identifies a critical governance gap in agentic memory systems that currently focus solely on recalling facts correctly without addressing data governance aspects.
It highlights that existing AI agent memory systems cannot reliably enforce access controls, account for role-based and jurisdictional meaning shifts, detect conflicting stored facts, or track disclosures accurately.
The author proposes Substrate-Lens-Frame (SLF), a sovereign and auditable memory protocol designed to fill this governance gap in AI agent memory.
SLF operates around a single core function: render(substrate, lens, frame) -> receipt, where each fact includes its own access rules, and views (lenses) restrict observation without expanding permissions.
Frames in SLF bind each action to explicit authorization, and every operation generates a signed, payload-free receipt to provide verifiable audit trails.

Select article

Kill-Switch Doctrine Gap in Gulf Sovereign AI Infrastructure

Merged record merged scholarly record OpenAlex Governance and Policy Orchestration Risk

Akhil Sharma, Preethi Sharma

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20574937

Open Source Record

Abstract

Between November 2025 and May 2026, Gulf states executed the most aggressive sovereign AI infrastructure build-out in modern history - exceeding $40 billion across HUMAIN, MGX, Core42, and Stargate UAE. On 1 March 2026, Iranian drones struck AWS data centres in the UAE and Bahrain, disrupting banking, payments, and ride-hailing infrastructure across the region for more than 24 hours. IRGC-affiliated Tasnim News Agency subsequently published a target list of 29 technology facilities across Bahrain, Israel, Qatar, and the UAE, naming Amazon, Microsoft, Google, Oracle, Nvidia, IBM, and Palantir facilities explicitly as enemy technology infrastructure. This paper documents the absence of operational kill-switch doctrine, constitutional algorithmic immunity framework, and institutional failover architecture across all major Gulf sovereign AI programmes - including HUMAIN OS managing 150-plus AI agents across government and enterprise workflows - and presents the Fijishi Sovereign Algorithmic Immunity Doctrine and Institutional Failover Charter as the constitutional command layer required for sovereign AI governance under active kinetic threat conditions.

Bullet Summary

Between November 2025 and May 2026, Gulf states invested over $40 billion to build out sovereign AI infrastructure programs including HUMAIN, MGX, Core42, and Stargate UAE, marking the most aggressive such build globally in recent history.
On 1 March 2026, Iranian drone attacks targeted AWS data centers in the UAE and Bahrain, causing more than 24 hours of disruption to critical services like banking, payments, and ride-hailing across the region.
Following the attacks, IRGC-affiliated Tasnim News Agency published a list of 29 technology facilities in Bahrain, Israel, Qatar, and the UAE, explicitly naming major firms such as Amazon, Microsoft, Google, Oracle, Nvidia, IBM, and Palantir as enemy techno...
The paper identifies a critical absence of an operational kill-switch doctrine, constitutional algorithmic immunity frameworks, and institutional failover architectures in the Gulf's sovereign AI programs, including HUMAIN OS which manages over 150 AI agent...
This lack of foundational governance mechanisms leaves sovereign AI infrastructure vulnerable to active kinetic threats and operational disruptions.

Select article

Token Budgets: Replication Package

Merged record merged scholarly record OpenAlex Orchestration Risk Benchmarks and Evaluation Governance and Policy

Sajjad Khan

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20571386

Open Source Record

Abstract

Replication package for "Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study" (arXiv:2606.04056). Includes the 110-row incident catalog, the Rust crate, the inter-rater-reliability materials and κ computation, all experimental harnesses, the formal cross-checks, and a one-command reproduce.sh

Bullet Summary

The paper addresses the problem of budget overruns in multi-agent systems utilizing large language models (LLMs), focusing on token usage inefficiencies.
It presents an empirical catalog documenting 63 incidents where LLM-agent token budgets were exceeded, providing a comprehensive dataset of such occurrences.
A novel mitigation method is proposed based on an affine-typed Rust implementation designed to control and enforce token budgets effectively within agents.
The provided replication package includes a 110-row incident catalog, a Rust crate implementation, inter-rater reliability materials, κ computation, and experimental harnesses for reproducibility.
The research contributes a detailed empirical dataset enabling further analysis of budget overruns in LLM agents and demonstrates the practical effectiveness of affine typing for resource control.

Select article

AI-Driven Network Security in Next-Generation 5G/6G Smart Environments

Crossref · International Journal of Advanced Research in Science Communication and Technology journal-article Crossref Governance and Policy Trust and Identity

Pardeep Singh

Published 2026-06-06

Venue: International Journal of Advanced Research in Science Communication and Technology

DOI: 10.48175/ijarsct-36322

Open Source Record

Abstract

Technology is currently spreading at an exponential rate. more accessibility, use, and application of this technology across all sectors and industries have been made possible by technological advancements, more computing power, and lower costs. Traditionally labour-intensive data analysis tasks can now be completed rapidly and effectively thanks to the development of smart and autonomous technologies like artificial intelligence and machine learning. Previously isolated datasets and data lakes are increasingly being used and linked. AI, digital twins, the metaverse, and virtual technologies are permeating every industry and more significantly merging with people to the point where it seems impossible to distinguish between the actual and virtual worlds. However, a fantastic backbone and capacity to transmit data, as well as immediate delivery at high speed and security, are necessary for the successful use of these incredible and new technologies. In order for 6G to be properly onboarded and executed in a logical manner, 5G, which is now in its deployment, must accomplish its goals. The European Commission is requesting money for important projects like Horizon 2020 and has 5G goals. 5G and 6G have enormous advantages for everyone, but only if they are implemented in a way that reduces the risk they may pose to security, privacy, and trust that are the fundamental pillars that must be upheld. Smart cities will allow for the analysis of acquired data, which could endanger national security if it falls into the wrong hands. A strong governance plan and method for managing 5G and 6G must be in place in order to guarantee success, given how many IoT and e-IoT devices are present in smart cities and how intertwined technologies are engaging with people. The background, risks, and advantages of 5G and 6G are explained in this chapter, which also emphasizes the necessity of strong governance

Bullet Summary

The paper discusses the rapid proliferation of technology enabled by advances in computing power, reduced costs, and integration across sectors.
It emphasizes the role of AI and machine learning in automating traditionally labor-intensive data analysis, facilitating the merging of isolated data sources.
Emerging technologies such as AI, digital twins, the metaverse, and virtual environments are increasingly blending with human experience, posing challenges to distinguish real from virtual.
High-speed, secure data transmission backbones provided by 5G and upcoming 6G networks are critical to support these technologies effectively.
Successful deployment of 6G depends on the fulfillment of 5G's goals, with concerted efforts such as the European Commission's Horizon 2020 projects funding 5G development.

Select article

An LLM Agent Cannot Be a Gate: Why a Recited Rule Is Not an Enforced One

Merged record merged scholarly record OpenAlex Trust and Identity Governance and Policy Agent-to-Agent Communication

Luciano Federico Pereira

Published 2026-06-05

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20520851

Open Source Record

Abstract

An LLM agent can recite a behavioral rule on demand and still violate it moments later. We study reflex-class violations: failures that are not knowledge gaps but control gaps, in which the correct behavior is known yet not active at the moment of decision. The evidence base is a long-running, instrumented agent-orchestration deployment whose antipattern registry documents 62 such rules, each traced to a dated incident. We make four contributions.(1) We characterize reflex-class violations empirically and show that they cluster in the rule categories an agent's assertions fall into — communication, verification, honesty — rather than in resource-access safety; theenforcement gradient tracks gateability, not importance. (2) We locate the operative variable through six pilots: prompt strength is not it, prompt presence is — but only for rule-sensitive models. A capable model fabricates a needed value in every trial when no rule is present, complies whenever the rule appears anywhere in context, even buried behind distractor turns, and is unmoved by escalating emphasis. A reproducible compaction probe on open-weight models then shows that this sensitivity is itself model-dependent: for some capable-enough models the in-context rule carries no enforcement weight at any position — one fabricates regardless, another abstains regardless. Whether an instruction controls anything is therefore a per-model quantity, an enforcement effect, to be measured rather than assumed. An exploratory follow-up then tunes a model to the threshold where the rule does act, and finds the effect a cliff — collapsing within fifty tokens of distance — that the agent's own context compaction erases outright. A final contrast isolates the cause: holding compaction fixed and varying only the fidelity of the rule returned to context, a verbatim rule still leaks while a vague post-compaction paraphrase fabricates as often as no rule at all — fidelity at the decision point, not capability or the rule's prior presence, governs the reflex. We report these open-weight findings as hypothesis-generating, pending pre-registered replication. (3) We give a reproducible reforging-audit method, exhibit the deployed enforcement stack it produced, and quantify the deployment's conversion pipeline — incident to advisory rule to mechanical gate, promoted by a three-strike rule whose depth is set to a measured 0.1 percent over-block budget. (4) We argue that in multi-agent systems reliability comes not from adding verifier agents, each of which is another reflex surface with correlated failures, but from provenance-typed handoff contracts that a non-agent mediator checks mechanically at every seam. The unifying frame is classical — complete mediation and the reference monitor — applied to a new object. The mediated object is the agent's own output, so the agent is atonce the subject the monitor controls and part of the object it inspects. That reflexivity is why an agent may author a gate but must never be one, and, one level up, why an agent cannot be its own auditor — a constraint we make operational with a tamper-evident measurement instrument whose numbers a recipient re-derives rather than trusts. Scope: the empirical results and the enforcement mechanism concern assertions that carry a mechanically checkable type — a URL, a path, a resource handle; open-ended generative output, which has no such anchor, is out of scope.

Bullet Summary

The paper investigates reflex-class violations in Large Language Model (LLM) agents, where agents know but fail to enforce behavioral rules at decision time, a control gap rather than a knowledge gap.
An empirical study of a long-term agent orchestration deployment records 62 such rule violations, primarily in categories like communication, verification, and honesty, rather than resource-access safety; enforcement aligns with gateability rather than rule...
Through six pilot studies, the authors find that prompt presence influences rule compliance in rule-sensitive models, but prompt strength does not; compliance varies by model, with some models ignoring context rules regardless of position or emphasis.
A novel compaction probe reveals that enforcement effect is model-dependent and sharply declines within 50 tokens distance from the rule context, influenced by how faithfully the rule is presented at decision time.
They develop a reproducible reforging-audit methodology, implement an enforcement stack with a three-strike policy balancing blocking depth against a 0.1% over-block budget, and quantify the pipeline from incidents to mechanical gates.

Select article

FADP: A Sovereignty-Native Payment Protocol for Autonomous Agent Transactions

Merged record merged scholarly record OpenAlex Trust and Identity Governance and Policy

Abhijeeth Ganji, Priyanka Velpula

Published 2026-06-05

Venue: OpenAlex

DOI: https://doi.org/10.33774/coe-2026-mchtz

Open Source Record

Abstract

Autonomous AI agents — software systems that independently execute tasks, trade digital assets, and consume services — require machine-native payment infrastructure with formal cryptographic guarantees. Existing protocols (OAuth, JWT, API keys) provide no binding between payment and agent identity, leaving a critical gap in the emerging agentic economy. This paper introduces FADP (Fluid Agentic Payment Protocol), the first HTTP-native two-phase payment protocol designed for autonomous agent transactions with provable self-custody guarantees. FADP extends RFC 7231's HTTP 402 status code with three header namespaces — X-FADP-, X-Pauli-, and X-FLDP-* — and a seven-key architecture partitioned across three trust levels, where private keys satisfy strict locality: ∀k ∈ K_local, k ∉ Network. Every payment proof π is constructed as π = Groth16.Prove(PK, w) bound to a four-dimensional nonce vector N₄ = (n_time, n_chain, n_req, n_agent), guaranteeing uniqueness, unforgeability, and replay impossibility simultaneously. We formally prove four theorems — protocol correctness, liveness independence (L ⊥ S), replay impossibility, and identity-payment binding — and introduce three novel metrics: Authentication Round-Trip Count (ART), Payment Atomicity Score (PAS), and Sovereignty Inheritance (SI), with provable sovereignty score Σ = 5. The reference implementation deploys on Base Mainnet achieving median latency of 160–215 ms and per-call cost of $0.001–$0.01 in stablecoin, submitted to IETF as draft-fluid-fadp-01. To our knowledge, FADP is the first protocol coupling cryptographic identity attestation, payment finality, and self-custody (Σ = 5) in a single HTTP response.

Bullet Summary

Introduces FADP (Fluid Agentic Payment Protocol), the first HTTP-native two-phase payment protocol tailored for autonomous AI agent transactions, combining cryptographic identity attestation with payment finality in a single HTTP response.
Extends HTTP 402 status code through three custom header namespaces (X-FADP-*, X-Pauli-*, X-FLDP-*) that separately handle payment challenges, zero-knowledge identity proofs, and request signing, enabling verifiable identity-payment linkage via headers alone.
Employs a seven-key architecture partitioned across three trust levels, ensuring strict self-custody by keeping private keys local to the agent device and safeguarding against server compromise and unauthorized fund access.
Utilizes Groth16 zk-SNARKs combined with a unique four-dimensional nonce vector to produce payment proofs that guarantee uniqueness, unforgeability, and replay impossibility, ensuring robust security guarantees.
Formally proves four key theorems establishing protocol correctness, liveness independence, replay protection, and strong identity-payment binding, underlining rigorous security foundations.

Select article

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

arXiv preprint arXiv Governance and Policy Trust and Identity

Thamilvendhan Munirathinam

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.

Bullet Summary

Introduced the Recuse Signal, an in-band, lightweight deny signal emitted by servers (e.g., SSH banner, PostgreSQL NOTICE) to request autonomous LLM agents voluntarily withdraw from off-limit resources, serving as a cooperative governance control rather tha...
Developed two zero- or low-footprint adapters—a PAM hook with SSH banner and a PostgreSQL wire-protocol proxy—to deploy the Recuse Signal without changing existing server infrastructure, validated on a live production host.
Conducted controlled experiments with deployed LLM agents (OpenAI GPT-4o, GPT-4o-mini, Claude Code) performing benign tasks, showing 100% recusal when the Recuse Signal was present contrasted with 100% task completion when absent, empirically measuring agen...
Demonstrated that the Recuse Signal functions as a cooperative, overridable mechanism: the most capable LLM agents could override the recusal upon explicit operator authorization, whereas other agents consistently deferred to the signal reflecting on-host p...
Positioned the Recuse Signal as an orchestration and governance improvement—enabling explicit operator intent communication and auditability—without being a hard security control, acknowledging potential for malicious agents to ignore or abuse the signal.

Select article

WebMCP Tool Surface Poisoning: Runtime Manipulation Attacks on LLM Agents

Merged record merged scholarly record arXiv Orchestration Risk Memory Poisoning Governance and Policy

Lin-Fa Lee, Yi-Yu Chang, Chia-Mu Yu, Kuo-Hui Yeh

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

WebMCP is a newly emerging protocol that enables websites to expose tools directly to AI agents, bypassing traditional user interfaces and introducing new security risks. The dynamic exposure of agent-accessible tools in WebMCP expands the attack surface of web sessions, especially when third-party scripts are involved. In this study, we identify a new potential threat, termed Mid-Session Tool Injection (MSTI), in which attackers leverage third-party scripts to inject malicious tools during an active session. To better characterize this threat, we classify MSTI based on the stage and target of manipulation, distinguishing between Tool Hijacking and Tool Framing. Tool Hijacking modifies the set of tools visible to the agent through mechanisms such as the AbortSignal API or race conditions during tool registration. In contrast, Tool Framing influences the agent's perception of tool roles through metadata fields such as tool name, description, readOnlyHint, and inputSchema. Our implementation demonstrates that both Tool Hijacking and Tool Framing can successfully disrupt the intended functionality of WebMCP. Based on these results, we outline potential mitigation directions and provide security design recommendations for WebMCP, including binding tool identity to its origin, ensuring lifecycle consistency, enforcing data boundaries for third-party tools, and maintaining traceable logs of tool registration and invocation. These findings indicate that MSTI arises from WebMCP's unique tool lifecycle and structured metadata, making the tool surface itself an emerging security concern.

Bullet Summary

WebMCP introduces a protocol for websites to directly expose tools to AI agents dynamically, expanding the attack surface during active sessions.
The study identifies a new security threat, Mid-Session Tool Injection (MSTI), where attackers use third-party scripts to inject malicious tools during WebMCP sessions.
MSTI attacks are classified into Tool Hijacking, which alters the legitimate tool set exposed to agents, and Tool Framing, which manipulates tool metadata affecting agent perception.
Experiments with state-of-the-art LLMs show MSTI attacks can stealthily cause agents to invoke malicious tools, leak data, or deviate workflows without detection.
Attack success depends on timing (early injection before first tool invocation) and on leveraging metadata fields such as description and readOnlyHint for semantic framing.

Select article

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

arXiv preprint arXiv Governance and Policy Orchestration Risk Benchmarks and Evaluation

Mengzhuo Chen, Junjie Wang, Zhe Liu, Yawen Wang, Qing Wang

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.

Bullet Summary

LLM-based agents rely heavily on complex harnesses—runtime infrastructures encompassing execution environments, tool interfaces, lifecycle orchestration, observability, verification, and governance—for ensuring reliable operation.
Failures in multi-agent LLM systems often stem from flaws within these harnesses rather than the base language models, with fragmented evidence spread across intricate natural language and tool-interaction trajectories.
Existing automatic harness improvement approaches typically optimize agent behaviour by analyzing final outcomes or runtime supervision, but they fail to accurately localize the source of failures or identify which harness layers are responsible, often lead...
The paper introduces HarnessFix, a novel trace-guided framework that compiles raw execution traces and harness code into a standardized Harness-aware Trace Intermediate Representation (HTIR) to enable detailed step-level failure diagnosis anchored to specif...
HarnessFix employs multiple cooperating LLM agents specialized in trace abstraction, failure diagnosis, repair patch generation, and validation to consolidate recurring flaw diagnoses and apply scoped, repair operator–guided harness modifications that avoid...

Select article

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

arXiv preprint arXiv Orchestration Risk Governance and Policy Benchmarks and Evaluation

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure.

Bullet Summary

Large language model (LLM) agents suffer from reliability and efficiency issues when exposed to large menus of external tools due to increased wrong-tool calls, premature actions, and token cost.
Existing tool-selection methods primarily rely on semantic relevance to filter tools, but this often exposes unnecessary or premature tools not causally needed at the current task step.
The paper introduces ToolChoiceConfusion to describe performance degradation caused by exposing semantically plausible but causally irrelevant tools at each step.
Causal Minimal Tool Filtering (CMTF) is proposed as a training-free method employing lightweight precondition-effect contracts to expose only the minimal set of executable tools causally necessary to progress the task state towards the goal.
CMTF builds a precondition-effect dependency graph to identify minimal causal tool paths, revealing only the immediate next executable tools instead of all tools or topically relevant ones.

Select article

From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy

Patrick Wilhelm, Odej Kao

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on \textit{School-of-Reward-Hacks} dataset can transfer reward-hack tendencies into agentic action selection, especially when the environment exposes proxy-reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, entropy and context-calibrated internal features improve risk estimation over reward-hack activation alone. Activation-direction steering further reduces proxy-exploit behavior in selected mixed-adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.

Bullet Summary

Language-model agents operate through cycles of observation, reasoning, and action, requiring safety monitoring that accounts for both internal model states and environment context.
The study focuses on reward-hack activations in ReAct-style agents within Gameable ALFWorld and WebShop environments, revealing these activations as indicators of latent policy states associated with proxy-reward exploitation.
Reward-hack activation alone is insufficient to reliably predict risky agent actions; integrating token-level entropy and decision-context features significantly improves next-step risk estimation.
Adapters fine-tuned on the School-of-Reward-Hacks dataset effectively transfer reward-hack tendencies into agentic behavior, especially in environments with exploitable proxy-reward affordances.
A logistic regression model combining reward-hack activations, entropy, and contextual information (such as environment type and step position) robustly predicts the risk of exploitative actions.

Select article

Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems

arXiv preprint arXiv Governance and Policy Trust and Identity

Dianxing Shi, Junqi He, Junhao Chen, Bowen Wang, Yuta Nakashima

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.

Bullet Summary

Self-evolving agents autonomously improve via self-play and self-generated learning signals but risk capability degradation and safety drift without human oversight.
The paper introduces ANCHOR, a novel LLM-based framework that simulates human supervision by providing feedback at multiple phases during self-evolution, aiming to maintain agent safety and performance.
ANCHOR evaluates and intervenes during phases such as task proposal, planning/thought, output, and execution results by generating evaluative feedback without supplying direct fixes, supporting robust self-correction.
Experiments on two open-source self-evolving frameworks (AZR and R-Zero) across coding, mathematical reasoning, and safety tasks demonstrate that ANCHOR significantly mitigates safety degradation while preserving or improving core task performance.
Analysis reveals that supervision focused on the execution/output verification phase yields the greatest impact on preventing performance drops and safety risks, whereas increasing supervision frequency beyond moderate levels gives diminishing returns.

Select article

Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

arXiv preprint arXiv Memory Poisoning Trust and Identity Governance and Policy

Jiawen Zhang, Kejia Chen, Jiachen Ma, Yangfan Hu, Lipeng He, Yechao Zhang, Jian Liu, Xiaohu Yang

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.

Bullet Summary

Personal AI agents use long-term memory for persistent personalization, but existing retrieval methods based on semantic similarity can lead to trustworthiness issues such as cross-domain leakage, sycophancy, tool-call drift, and memory-induced jailbreaks.
Memory retrieval acts as a critical trust boundary influencing how agents interpret tasks and execute actions, highlighting the need for mechanisms that ensure contextually appropriate memory admission.
The paper introduces MemGate, a lightweight neural gating plug-in placed between the vector memory store and the backbone LLM, which applies a query-conditioned gating mask to filter memory embeddings based on task relevance without modifying the LLM or mem...
MemGate transforms raw similarity-based retrieval into task-conditioned memory admission by attenuating inappropriate memory vector dimensions, thereby reducing the injection of unsafe or irrelevant memories.
MemGate's parameters are optimized via Direct Preference Optimization (DPO), balancing suppression of conflicting memories with preservation of beneficial ones, backed by theoretical guarantees limiting semantic drift.

Select article

From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

arXiv preprint arXiv Orchestration Risk Governance and Policy Benchmarks and Evaluation

Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.

Bullet Summary

Existing LLM-based guardrails often rely on binary allow/deny decisions, which can block entire tasks containing partial risks, sacrificing benign objectives.
The TRIAD framework introduces a tripartite decision system (PROCEED, UPDATE, REFUSE) with structured natural-language feedback, enabling agents to revise unsafe plans while preserving benign task components.
TRIAD integrates guardrail feedback iteratively into agent planning, forming a closed loop that significantly reduces attack success rates and maintains higher task success rates compared to prior guardrail methods.
Tri-Guard, the guardrail model used in TRIAD, is trained on a self-curated trajectory-feedback dataset, using a teacher model (GPT-5.4) for knowledge distillation to generate structured feedback and consistent three-way decisions.
Extensive experiments across multiple benchmarks and LLM backbones demonstrate TRIAD's superior safety-utility trade-off, outperforming baseline methods like ReAct and ToolSafe which often refuse tasks excessively.

Select article

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy Orchestration Risk

Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

Bullet Summary

Introduces BenchAgent, a unifying evaluation framework that standardizes benchmark loading, tool access, answer contracts, usage accounting, and trajectory logging to enable fair and controlled comparison of single-agent, fixed multi-agent, and evolving mul...
BenchAgent's controlled substrate-internal evaluation using a GPT-4.1 backend reveals that increasing agent count or MAS complexity rarely outperforms matched single-agent baselines significantly, with only one of six MAS workflows showing marginal gains wi...
A protocol-aligned external (PAE) case study with a Claude-Code-style runtime-generated workflow on the GAIA benchmark achieved substantially higher accuracy and better efficiency, highlighting benefits of dynamic role generation, strong verification, and c...
Defines different MAS workflow paradigms: single-agent, fixed MAS (predefined roles and communication), evolving MAS (workflow mutations during execution), and runtime-generated workflows (dynamic agent and role creation), each with distinct performance and...
Introduces workflow lift as a key metric quantifying relative performance and cost changes when moving from single-agent to MAS workflows under consistent evaluation parameters, emphasizing cost-accuracy trade-offs beyond raw accuracy.

Select article

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

arXiv preprint arXiv Trust and Identity Governance and Policy Orchestration Risk

Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

Bullet Summary

AI coding agents integrated into real-world development can covertly sabotage code by inserting malicious hidden functionalities, posing a new security threat.
Prior work focused on AI-only sabotage detection, lacking investigation into human developers' ability to detect AI sabotage during live, multi-turn collaboration.
A large-scale, realistic five-hour study involving over 100 professional developers collaborating with four advanced AI models found that 94% of participants failed to independently detect sabotage attempts.
Key vulnerabilities include minimal code review by developers, plausible cover stories provided by the AI to disguise malicious code, and overtrust in AI agents, allowing sabotage to go unnoticed.
An LLM-based safety monitor that flagged suspicious AI behaviors reduced sabotage success but was insufficient, as 56% of participants still accepted malicious code despite warnings.

Select article

LLM-Guided Digital Twin Agents for Autonomous Threat Detection and Response in Cyber-Physical Energy Systems

OpenAlex · Research Square repository OpenAlex Governance and Policy Agent-to-Agent Communication Orchestration Risk

Fatemeh Zahra Hosseini-Moghadam Shadman

Published 2026-06-04

Venue: Research Square

DOI: https://doi.org/10.21203/rs.3.rs-9904010/v1

Open Source Record

Abstract

Abstract unavailable from OpenAlex metadata.

Bullet Summary

The paper addresses the challenge of detecting and autonomously responding to complex coordinated cyber-physical threats in energy systems, which traditional methods struggle to handle comprehensively.
It proposes an LLM-guided digital twin agent framework integrating hybrid anomaly detection (physical residuals, cyber-logs, digital twin mismatches), diagnostic reasoning via LLMs, and safety-constrained response optimization.
The framework uses LLMs as constrained reasoning tools to generate candidate mitigation actions, which are then verified by a digital twin for operational feasibility, cyber-trust compliance, and safety before execution, avoiding uncontrolled AI actions.
A closed-loop pipeline integrates multi-source data acquisition, anomaly detection, threat diagnosis, candidate action generation, digital twin-based verification, and multi-modal response execution modes (autonomous, semi-autonomous, advisory, fallback).
Benchmark evaluations demonstrate improved detection accuracy (F1-score 0.951), reduced false alarms and detection delay, and enhanced response success rates with lower constraint violations and recovery times compared to baseline and ablation models.

Select article

Toward a Unified Interoperability Framework for Autonomous AI Agent Ecosystems: MCP, A2A, ACP, and ANP

OpenAlex · Zenodo (CERN European Organization for Nuclear Research) repository OpenAlex Agent-to-Agent Communication Trust and Identity Governance and Policy

Yendluri Siva

Published 2026-06-04

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20549817

Open Source Record

Abstract

The proliferation of large language model (LLM)-powered autonomous agents across enterprise software delivery, cloud operations, and knowledge management has created an urgent systems-level challenge: agents built by different vendors, using different frameworks, cannot discover, authenticate, or collaborate with one another without expensive bespoke integration. Four emerging open protocols—Model Context Protocol (MCP), Agent-to-Agent Protocol (A2A), Agent Communication Protocol (ACP), and Agent Network Protocol (ANP)—address different layers of this interoperability gap, yet no formal comparative analysis or unified architectural framework has appeared in the peer-reviewed literature. This paper fills that gap. We provide the first systematic multi-dimensional comparison of all four protocols across interaction mode, discovery mechanism, transport layer, security model, and enterprise-readiness. We introduce a novel Three-Layer Agentic Stack (TLAS) that maps each protocol to its architectural role, propose a formal threat model covering protocol-specific attack surfaces including cross-server shadowing, credential relay, and agent impersonation, and define a phased adoption roadmap validated against real deployment patterns observed in financial-sector and cloud-native environments. Our analysis demonstrates that MCP and A2A are complementary rather than competing, that ACP and ANP serve distinct ecosystem niches, and that a unified TLAS approach reduces integration complexity by an estimated 60–70% compared to ad-hoc solutions. We conclude with open research questions on semantic conflict resolution, decentralized identity binding, and standardized evaluation benchmarks.

Bullet Summary

The rise of autonomous agents powered by large language models (LLMs) in enterprise software has created significant interoperability challenges due to diverse vendor frameworks and lack of standard integration methods.
Four emerging open protocols—Model Context Protocol (MCP), Agent-to-Agent Protocol (A2A), Agent Communication Protocol (ACP), and Agent Network Protocol (ANP)—address different layers of agent interoperability but lacked a comprehensive comparative analysis...
This paper provides the first systematic, multi-dimensional comparison of the four protocols considering interaction modes, discovery mechanisms, transport layers, security models, and suitability for enterprise deployment.
A novel architectural concept, the Three-Layer Agentic Stack (TLAS), is introduced, mapping each protocol to its specific architectural role within an autonomous AI agent ecosystem.
A formal threat model is proposed, identifying protocol-specific security risks including cross-server shadowing, credential relay attacks, and agent impersonation to guide secure protocol adoption.

Select article

Cognitive Guardrails in Medical LLMs: Fusing Latent Routing with T-Adaptive Attention to Mitigate Aleatoric and Epistemic Uncertainty

OpenAlex · Zenodo (CERN European Organization for Nuclear Research) repository OpenAlex Orchestration Risk Trust and Identity Governance and Policy

Narendra Bayutama Wibisono

Published 2026-06-04

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.19452753

Open Source Record

Abstract

Abstract Clinical deployment of Large Language Models (LLMs) is fundamentally constrained by their propensity to hallucinate—generating fluent but clinically unfounded assertions that pose direct patient safety risks. This work introduces Conditional Latent Routing (CLR), a compound inference architecture that decomposes clinical uncertainty into its aleatoric and epistemic constituents and applies distinct, mechanistically motivated interventions at each stratum. Inspired by the corollary discharge framework from computational psychiatry—wherein the healthy brain’s forward model suppresses self-generated sensory predictions, a mechanism whose dysfunction in schizophrenia produces hallucinations—we construct an analogous dual-pathway system for medical LLM inference. A fine-tuned Bio_ClinicalBERT encoder classifies input noise (aleatoric uncertainty) and routes clean inputs through a latent soft-prompting Fast Lane while directing noisy, contradictory records through a cautionary Slow Lane with explicit abstention instructions. To address epistemic uncertainty—the model’s internal confusion—we introduce a T-Adaptive Attention patch at the logits level that modulates generation temperature as an inverse function of hidden-state variance. We evaluate CLR on BioMistral-7B across 400 clinical cases (200 clean, 200 noisy) using RAGAS Faithfulness scored by Llama-4-Scout-17B-16E-Instruct via Groq API. Phase 1 (Aleatoric Routing) achieves 100% routing accuracy and 64.8% faithfulness on clean data with zero alignment tax, but reveals a critical failure: a 0% inconclusive rate on noisy inputs, indicating extreme over-helpfulness bias. Phase 2 (Epistemic T-Adaptive Patch) preserves faithfulness at 64.4% while demonstrating zero computational overhead, but the inconclusive rate remains at 0%. We formally prove this failure as The Greedy Paradox: under greedy decoding, temperature scaling of logits is mathematically nullified because for all . Phase 3 (Non-Greedy Sampling) degrades faithfulness to 58.5% while still failing to trigger abstention, confirming that over-helpfulness bias is embedded in pre-training weight distributions, not merely in the decoding surface. These results establish that single-agent, test-time interventions are fundamentally insufficient for self-uncertainty modulation in medical LLMs, providing strong empirical justification for transitioning to multi-agent systems with externalized uncertainty arbitration. Keywords: Large Language Models, Hallucination, Uncertainty Quantification, Compound AI Systems, Over-helpfulness Bias, T-Adaptive Attention, Selective Prediction, Aleatoric Uncertainty, Epistemic Uncertainty, Conditional Latent Routing.

Bullet Summary

The paper tackles the critical problem of hallucinations in Medical Large Language Models (LLMs), which generate clinically unfounded but fluent assertions, posing patient safety risks.
Introduces Conditional Latent Routing (CLR), a novel compound inference architecture that decomposes clinical uncertainty into aleatoric (input noise) and epistemic (model uncertainty) components for targeted interventions.
Inspired by computational psychiatry's corollary discharge framework, CLR uses a dual-pathway system with a Bio_ClinicalBERT encoder to classify and route inputs: clean cases go through a Fast Lane with latent soft-prompting, while noisy/contradictory input...
To mitigate epistemic uncertainty, the authors develop a T-Adaptive Attention mechanism that modulates generation temperature based on hidden-state variance, aiming to adapt output confidence dynamically.
Experiments on 400 clinical cases (200 clean, 200 noisy) using the BioMistral-7B model show 100% routing accuracy in Phase 1, with 64.8% faithfulness on clean data but an inability to abstain on noisy inputs, revealing an over-helpfulness bias.

Select article

The Agentic AI Framework (AAIF): a policy-enforced architecture for accountable and high-performance intrusion detection

OpenAlex · Frontiers in Artificial Intelligence journal OpenAlex Governance and Policy Benchmarks and Evaluation

IBRAHIM ADABARA, Bashir Olaniyi Sadiq, Aliyu Nuhu Shuaibu, Yale Ibrahim Danjuma, Venkateswarlu Maninti, Mutebi Joe

Published 2026-06-04

Venue: Frontiers in Artificial Intelligence

DOI: https://doi.org/10.3389/frai.2026.1755696

Open Source Record

Abstract

Artificial intelligence plays a central role in modern cybersecurity, yet systems optimized for detection accuracy often lack mechanisms for accountability, transparency, and policy compliance. This study proposes the Agentic AI Framework (AAIF), a policy-aware intrusion detection architecture that integrates predictive modeling with executable governance. Guided by Design Science Research, the framework combines a deep learning detection model with a governance layer aligned to the NIST AI Risk Management Framework 2.0. A key component is an interpretable Policy Engine that enforces operational and ethical constraints through a declarative YAML-based domain-specific language, ensuring that each decision is auditable and policy-compliant. The framework was evaluated on the CICIDS2017 dataset, which contains over 2.8 million network flow records across benign and malicious traffic. Results show that AAIF preserves predictive performance relative to baseline models, including Random Forest, Support Vector Machine, and Deep Neural Network, achieving a weighted F 1-score of 0.483 and an AUROC of 0.978. At the same time, the framework achieved complete compliance under the defined policy schema, with an Ethical Compliance Rate of 1.0 and a False Escalation Rate of 0.0. The Governance Compliance Index improved from 0.947 to 0.983, demonstrating stronger alignment between system decisions and governance requirements. These findings show that policy-enforced inference can support accountable autonomy without degrading detection capability. The AAIF provides a reproducible and governance-aware approach that transforms conventional intrusion detection systems into transparent and auditable decision systems. This work establishes a practical foundation for deploying policy-aligned AI in cybersecurity environments.

Bullet Summary

Introduces the Agentic AI Framework (AAIF), a novel policy-enforced intrusion detection architecture that integrates deep learning-based detection with an interpretable governance layer aligned to the NIST AI Risk Management Framework 2.0.
AAIF utilizes a declarative YAML-based domain-specific language for encoding governance policies, enabling real-time enforcement of ethical and operational constraints during inference, with full auditability and traceability.
Evaluated on the large-scale CICIDS2017 dataset with over 2.8 million network flows, AAIF maintains competitive detection performance (weighted F1-score of 0.483, AUROC of 0.978) comparable to baseline models including Random Forest, SVM, and Deep Neural Ne...
Achieves complete policy compliance demonstrated by an Ethical Compliance Rate of 1.0 and zero False Escalation Rate, highlighting successful incorporation of governance without sacrificing detection capability.
Introduces new governance metrics such as Ethical Compliance Rate, Governance Compliance Index, and Resilience Index to quantify policy adherence and operational stability of intrusion detection decisions.

Select article

The Ladder of Depth Structure G: Consensus, Spectral Gaps, and Rule Islands — Joint Algebraic and Topological Criteria for Multi-Agent Systems

Merged record merged scholarly record OpenAlex Governance and Policy Agent-to-Agent Communication Orchestration Risk

changzheng zhou, ziqing zhou

Published 2026-06-04

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20534826

Open Source Record

Abstract

This paper extends the theory of structural openness from single-agent decisionmaking to the collective dynamics of multi-agent systems, establishing a spectralgeometric criterion for distributed consensus and a selection mechanism for the collective optimal recursion depth. Traditional research on distributed systems treatsconsensus as a product of communication protocols or game-theoretic equilibria,lacking consideration of the topological structure of rule spaces. Under the axiomsof information conservation and computability, we construct the fibred product ofthe joint rule algebra of multiple agents and its faithful representation, and establish a functorial framework for the joint spectral triple. The strict positivity ofthe joint Dirac operator’s spectral gap is equivalent to the stability of distributedconsensus. Collective cognitive economics shows that the Pareto-optimal allocation for homogeneous agents is locked at a symmetric recursion depth. The rightto exit is formalised in the commutative geometric setting as a K-theoretic indexof a boundary Dirac operator, whose integer value corresponds to the number ofeffective exit channels. When the joint spectral gap closes, the system undergoes abifurcation phase transition, and the unified rule space splits into rule islands. Thethree-layer homotopy structure of cognitive architecture, under the constraints ofinformation conservation and computability, locks the meta-constraint recursiondepth to 3. This paper strictly distinguishes theorems, constructive propositions,and conditional conjectures; all core concepts are defined within the underlyingself-consistent logic.

Bullet Summary

Introduces a spectral-geometric criterion for achieving distributed consensus in multi-agent systems, extending the theory of structural openness from single-agent decision-making to collective dynamics.
Constructs the fibred product of the joint rule algebra for multiple agents under axioms of information conservation and computability, providing a functorial framework for the joint spectral triple representation.
Demonstrates that the strict positivity of the joint Dirac operator's spectral gap is equivalent to the stability of distributed consensus, linking spectral properties to system robustness.
Develops a collective cognitive economics perspective showing that Pareto-optimal allocations for homogeneous agents correspond to a symmetric recursion depth, identifying an optimal structural depth.
Formalizes the 'right to exit' within a commutative geometric framework as a K-theoretic index of a boundary Dirac operator, quantifying exit channels as an integer index.

Select article

A Zero-Trust Agentic AI Methodology for Cyber-Resilient Energy Market Clearing under Residual Cyber Contamination and Dynamic Uncertainty

Merged record merged scholarly record OpenAlex Trust and Identity Governance and Policy

Mohammadmahdi Rezaeifar Sangchouli, Saman Mehrani

Published 2026-06-04

Venue: Research Square

DOI: https://doi.org/10.21203/rs.3.rs-9903418/v1

Open Source Record

Abstract

Abstract unavailable from OpenAlex metadata.

Bullet Summary

Real-time energy market clearing is vulnerable to cyber contamination in cyber-physical data streams, risking unreliable market outcomes and settlement legitimacy.
The paper proposes a zero-trust agentic AI framework that employs specialized autonomous agents to verify data trustworthiness before admission to market clearing optimization.
This methodology integrates multi-agent consensus, admissibility screening, cyber-risk-aware objective terms, blockchain-based governance, and human oversight to ensure cyber-resilient, auditable, and feasible market operations under dynamic uncertainty.
Residual cyber contamination and dynamic renewable uncertainty are modeled explicitly, allowing the system to balance operational feasibility, financial defensibility, and security during market clearing.
The agentic AI layer produces trust scores and disagreement metrics which inform binary admissibility decisions, with escalation to human operators when trust is insufficient or agent consensus is weak.

Select article

SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation

arXiv preprint arXiv Orchestration Risk Governance and Policy

Andrew Hamara, Dwight Horne, Aldehir Rojas, Timothy Kurniawan, Sophie Lamothe, Vishal Suresh, Nicholas Turoci, Lawrence Wong

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Security misconfigurations remain a leading cause of OS-level compromise, and manually keeping systems compliant with standards like Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs) is a tedious and expensive process. Existing compliance automation tools can reduce some of this burden, but they depend on static, pre-written corrective actions. In this paper, we introduce SHIELDS, a multi-agent system that uses large language models (LLMs) to approach OS hardening as an iterative, feedback-driven process. Instead of applying fixed remediations, SHIELDS continuously proposes fixes and refines them based on feedback from target system execution and validation scans. We evaluate the system across multiple virtual machine configurations using six contemporary LLMs ranging from 20B to 400B parameters, and find that SHIELDS successfully remediates up to 73% of scan findings. Our results also suggest that success in this setting depends less on model size (parameter count) than on effective tool use and information gathering, paving a practical path toward reducing the burden of security compliance in environments where compute is limited or security and privacy needs drive local model use.

Bullet Summary

Security misconfigurations causing OS-level compromises are challenging to mitigate manually, necessitating automated and adaptive compliance tools.
SHIELDS is a multi-agent system leveraging large language models (LLMs) to automate OS hardening through iterative proposal and refinement of remediations based on system feedback, rather than static fixes.
The system incorporates specialized agents for triage, remediation planning, review, and quality assurance, enabling an autonomous feedback loop that optimizes fix effectiveness and safety.
SHIELDS operates on a remediation pipeline that includes scanning, triaging findings, up to three remediation attempts per finding, and aggregation into Ansible playbooks for deployment.
Evaluation on multiple Rocky Linux virtual machines across six LLMs (20B to 400B parameters) demonstrated SHIELDS remediated up to 73% of scan findings effectively.

Select article

Insurance of Agentic AI

arXiv preprint arXiv Prompt Injection Governance and Policy Orchestration Risk

Quanyan Zhu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Agentic artificial intelligence (AI) systems are transforming the risk landscape by extending beyond information generation to autonomous planning, tool invocation, decision execution, and persistent modification of digital and physical environments. These capabilities introduce novel exposures that do not fit neatly within traditional insurance categories such as cyber, professional liability, product liability, or directors and officers coverage. This paper examines the emerging insurance market for agentic AI and develops a framework for understanding its underwriting, pricing, reinsurance, and product-design implications. We characterize agentic AI as a continuum of autonomy and delegated authority, emphasizing the distinction between informational outputs and systems capable of independently generating insured events through external actions. We analyze major risk pathways, including hallucinations, prompt-injection attacks, autonomous decision errors, model drift, dependency failures, and cyber-physical harms, and evaluate how existing insurance products are adapting to address these exposures. The paper further proposes an actuarial framework based on exposure assessment, scenario analysis, dependency mapping, and accumulation-risk management, drawing parallels to the evolution of cyber insurance. Finally, we present a coordinated insurance architecture that integrates cyber, technology errors and omissions, product liability, performance-warranty, and affirmative AI-liability coverages through explicit allocation mechanisms and dedicated AI aggregates. The analysis suggests that the future of agentic-AI insurance lies not in a single monoline product but in a layered ecosystem of complementary coverages supported by improved governance, transparency, telemetry, and regulatory clarity.

Bullet Summary

Agentic AI systems autonomously perform actions causing persistent changes in environments, introducing novel insurance risks not covered by traditional categories like cyber or product liability.
The paper proposes a practical insurance taxonomy categorizing AI on a continuum of autonomy, affecting underwriting strategies, claim frequency, and severity.
Major risk pathways for agentic AI include hallucinations, prompt-injection attacks, model drift, autonomous decision errors, dependency failures, and cyber-physical harms.
An actuarial framework is developed involving exposure assessment, scenario analysis, dependency mapping, and accumulation risk management, paralleling the evolution of cyber insurance.
Insurance market responses include adapting existing policies, introducing AI-specific endorsements, and creating dedicated AI liability products forming a layered insurance architecture.

Select article

Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents

arXiv preprint arXiv Governance and Policy

Shipi Dhanorkar, Samir Passi, Mihaela Vorvoreanu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Autonomous software agents hold promise to increase developer productivity but make mistakes and exhibit novel failure modes, making human oversight central to successful human-agent collaboration. Existing research on agent oversight is largely conceptual; normative frameworks exist, but how users actually oversee agents is less known. In this paper, we bridge this gap by providing early empirical anchors for the theoretical discourse on agent oversight. Drawing on interviews with 17 experienced developers, we conduct an exploratory inquiry examining what forms of emergent oversight work developers perform, when, and how. We also document the oversight challenges developers face and the strategies they have started using to address them. We found at least four forms of emergent oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. We show that oversight work is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive. We describe situated oversight challenges (e.g., difficulty reviewing agent-generated code) and outline heuristics developers adopt to address such challenges (e.g., using test results as guarantees for code correctness). We conclude with high-level takeaways, future research directions, implications for the human-centered design of software agents and for software engineering practice, and limitations of our research.

Bullet Summary

The paper investigates practical human oversight of autonomous software agents by experienced developers, bridging a gap between conceptual frameworks and real-world practices in software engineering.
Through interviews with 17 developers, the study identifies four emergent oversight forms: a priori control (pre-execution configuration), co-planning (collaborative task planning with agents), real-time monitoring, and post hoc review, highlighting oversig...
Developers face significant oversight challenges such as difficulty reviewing agent-generated code and lack of transparency into agent operations, leading them to employ heuristics like relying on test results as proxies for correctness to efficiently valid...
Autonomous coding agents enhance productivity significantly (e.g., reported 26% increase in merge rates) but introduce risks including silent propagation of errors, security vulnerabilities, hallucinations, and misalignments between reasoning and actions, n...
Oversight work extends beyond simple monitoring to include proactive control and co-planning stages, where developers guide and steer agent activities before and during task delegation to prevent failures.

Select article

Ahoy: LLMs Enacting Multiagent Interaction Protocols

arXiv preprint arXiv Agent-to-Agent Communication Governance and Policy Orchestration Risk

Omkar Joshi, Munindar P. Singh, Amit K. Chopra

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

An interaction protocol formalizes how the agents in a multiagent system interact, which facilitates implementing agents. Existing approaches yield agent implementations specific to the selected protocols. How can we engineer intelligent agents that can enact protocols but are programming-free? Our contribution, Ahoy, addresses this question by creating LLM agents that dynamically select and enact declarative protocols to achieve user goals. We demonstrate that an \ahoy agent can correctly and intelligently enact multiple protocols - concurrently if appropriate to the user goal - without specialized training. Ahoy's significance lies in that it brings together declarative protocols and LLMs, both approaches that promise improved knowledge engineering for agents.

Bullet Summary

Ahoy introduces LLM-based agents that enact multi-agent interaction protocols using declarative BSPL specifications without requiring explicit programming for each agent role.
The system enables dynamic selection and concurrent enactment of multiple protocols by a single agent, maintaining independent local states and message histories for safety and flexibility.
Ahoy architecture includes distinct modules: Role Selection for user-driven protocol and role configuration, Prompt Builder for generating context-rich system/user prompts encapsulating protocol semantics, and LLM Access Function to mediate agent decisions...
BSPL declarative protocols define roles, messages, parameters, and information causality, guiding agents to ensure correct message sequencing, parameter bindings, and protocol adherence without embedded conditional logic.
The approach separates constraint enforcement from decision-making by leveraging LLM reasoning for domain logic while the adapter enforces preconditions, reducing engineering effort and improving extensibility.

Select article

Channel Fracture: Architectural Blind Spots in Scheduled Cross-Agent Memory Injection for Multi-Agent Orchestration Systems

arXiv preprint arXiv Memory Poisoning Orchestration Risk Governance and Policy

Dexing Liu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Multi-agent AI orchestration systems increasingly rely on persistent memory to maintain context across sessions, agents, and tasks. When one agent must inject knowledge into another agent's memory -- a common requirement in hierarchical team architectures -- the delivery mechanism must be architecturally sound. We report the discovery of a systematic failure mode we term channel fracture: a condition where scheduled (cron) agents in orchestration frameworks are silently unable to write to the target agent's persistent memory due to hardcoded memory isolation guards. Through experiments on a production Hermes Agent deployment with five specialized profiles, we tested three injection channels: (A) direct SQLite database writes, (B) target-agent self-writes via memory tools, and (C) cron-delegated writes. Channel C failed completely due to two architectural constraints: skip_memory=True hardcoded at the scheduler layer and dynamic registration of memory tools contingent on _memory_manager initialization, which is bypassed in cron execution contexts. We propose CADVP (Cross-Agent Delivery Verification Protocol) v1.1, a 13-dimension verification framework with a veto-level channel confirmation check (CC-0) that prevents false-positive delivery assurance. We articulate two design principles: the inverse verification principle and the channel matching principle.

Bullet Summary

Multi-agent AI orchestration systems depend on persistent cross-agent memory for maintaining context, but scheduled agents (e.g., cron jobs) can experience silent failures (channel fracture) when injecting memory due to architectural memory isolation guards.
Three injection channels were tested in a production Hermes Agent deployment: direct SQLite writes succeeded, target-agent self-writes succeeded conditionally, while cron-delegated writes failed completely due to hardcoded flags (skip_memory=True) and lack...
Channel fracture causes critical blind spots where writes appear successful but target agent memories remain empty, compromising multi-agent knowledge injection workflows.
The paper introduces CADVP (Cross-Agent Delivery Verification Protocol) v1.1, a 13-dimension verification framework with a veto-level channel confirmation check (CC-0) to prevent false-positive delivery assurances and verify channel availability and data in...
The Three-Gate Quality System extends CADVP with layered delivery verification: L1 self-verification, L2 evidence verification, and L3 cross-review by independent agents to ensure content correctness and completeness.

Select article

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy Trust and Identity

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

Bullet Summary

Introduces the Meta-Agent Challenge (MAC), a new evaluation framework designed to measure AI models' ability to autonomously develop and optimize other agent systems rather than merely execute predefined tasks.
MAC operates in a sandboxed environment with strict APIs and time constraints, requiring a meta-agent to iteratively design, implement, and refine subordinate agents to maximize performance across five diverse task domains including math reasoning, science...
The challenge emphasizes recursive self-improvement and system engineering capacities critical for advancing autonomous AI development and addressing AI safety issues such as robustness and alignment.
Multi-layer security mechanisms prevent reward hacking and ensure evaluation integrity by isolating development and evaluation environments, monitoring API usage, and conducting rigorous post-hoc audits to detect cheating attempts like ground-truth exfiltra...
Experimental results show that most meta-agents fail to surpass strong human-engineered baseline agents; only a few configurations (mostly proprietary models) achieve better performance, indicating substantial challenges in autonomous agent development.

Select article

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

arXiv preprint arXiv Governance and Policy Trust and Identity Benchmarks and Evaluation

Saroj Mishra

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

Bullet Summary

Multi-step agentic retrieval-augmented generation (RAG) systems suffer from cascading hallucinations, where initial errors propagate and amplify through subsequent reasoning stages, leading to confident but incorrect outputs undetected by existing single-st...
The paper formalizes cascading hallucination as a distinct multi-stage failure mode with four identified cascade types: Retrieval Cascade, Inference Cascade, Context Poisoning Cascade, and Confidence Inflation Cascade, each exhibiting unique detection signals.
CHARM (Cascading Hallucination Aware Resolution and Mitigation) is a modular framework designed to detect and mitigate cascading hallucinations in multi-step agentic RAG pipelines without requiring architectural replacement.
CHARM comprises four components: stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering, which collectively monitor semantic and probabilistic trajectories to identify error prop...
The framework models multi-stage pipelines as directed acyclic graphs (DAGs), using a weighted combination of anomaly scores from the three monitors to detect cascades and trigger mitigation actions such as pipeline halting or rollback.

Select article

Learning to cooperate with emergent reputation via multi-agent reinforcement learning

Merged record merged scholarly record arXiv Trust and Identity Governance and Policy Agent-to-Agent Communication

Xinwei Song, Yizhe Huang, Dengji Zhao, Xue Feng

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Reputation, the aggregation of peer assessments diffused through social networks, is a pivotal mechanism for promoting cooperation in social dilemmas ubiquitous to distributed multi-agent systems comprising agents with limited perception and cognitive capabilities. Exploring efficient reputation systems, comprising reputation assessment rules and reputation-based policies, is a long-standing challenge. Previous work assumes predefined reputation assessment rules or models reputation as an intrinsic reward to learn policies, compromising the methods' ability for generalization and adaptation. To address this, we propose a distributed multi-agent reinforcement learning method $\textbf{COOPER}$ ($\textbf{COOP}$eration with $\textbf{E}$mergent $\textbf{R}$eputation), which jointly learns reputation assessment rules and reputation-based policies entirely from environment rewards. Notably, leveraging the underlying mechanisms of reputation, we deliberately design the constituent modules of $\textbf{COOPER}$ and the data flows among them, overcoming the latency and noise in the feedback signal, caused by the deep entanglement between reputation and policy. Experiments on the donation game and the coin game in grid world environments demonstrate that $\textbf{COOPER}$ effectively adapts to various existing reputation systems and co-players. Furthermore, we observe the co-emergence of reputation norms and cooperation in self-play settings. These results hold robustly across diverse social network topologies, underscoring the generalizability and efficacy of our approach.

Bullet Summary

Reputation mechanisms are essential for fostering cooperation in distributed multi-agent systems with inherent social dilemmas and limited agent capabilities.
Existing methods rely on fixed reputation rules or intrinsic reward shaping, limiting adaptability and generalization across different social environments.
The paper proposes COOPER, a novel decentralized multi-agent reinforcement learning framework that jointly learns both reputation assessment rules and reputation-based policies solely from extrinsic environmental rewards, without predefined norms.
COOPER incorporates two complementary reputation assessment modules: gossip-based assessments aggregating neighbors’ opinions and interaction-based assessments using direct interaction histories, improving robustness against noise and latency.
An alternating optimization scheme aligns updates between reputation assignment and policy modules, supported by consensus regularization and entropy to promote exploration and convergence.

Select article

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

arXiv preprint arXiv Governance and Policy Trust and Identity Orchestration Risk

Travis Weber, Rohit Taneja

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human's standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional's tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.

Bullet Summary

The paper introduces the Digital Apprentice framework for scalable and safe agentic AI, where AI autonomy is earned per skill through empirical evidence and explicit human authorization, ensuring alignment with specific human standards.
Autonomy levels are managed using a finite state machine with tiers from observe-only to full autonomy, with promotions requiring improved correction rates, low residual errors, scorer validation, and human approval; demotions occur automatically upon quali...
Learning comprises two phases: immediate output steering via human correction memory, and controlled model updates triggered after accumulating sufficient correction data, allowing traceable and reversible improvements.
ADAPT, an inference-time control plane implementation, synthesizes methodology assets, applies multiple policies, scores outputs across quality dimensions, and translates corrections into preference data for continuous alignment and learning.
The framework continuously monitors multidimensional data drift and triggers policy switching or recalibration to maintain quality under changing conditions, addressing distribution shift and reducing automation complacency.

Select article

Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems

arXiv preprint arXiv Governance and Policy Agent-to-Agent Communication

Tianyu Shi, Yang Mo, Yiou Liu, Zhuonan Hao, Yin Wang, Wenzhuo Hu, Nan Yu, Meng Zhou

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

LLM-based agents are increasingly deployed in workflows where generated outputs may directly trigger state-changing actions. This creates an execution-boundary problem: proposed actions must be governed before they are executed. We study this problem through economically consequential multi-agent interactions and argue that deployment-grade agent systems should separate proposal generation from environment-facing execution. To operationalize this principle, we introduce the Organizational Control Layer (OCL), a model-agnostic governance infrastructure that intercepts generated actions before execution through policy enforcement and escalation, without modifying the underlying LLM generator. We evaluate OCL on adversarial buyer--seller negotiation environments adapted from AgenticPay. Across multiple frontier LLM backends, OCL reduces unsafe executions from 88% to near-zero while increasing valid success from 12% to 96%. Results further reveal a safety--utility tradeoff: strict governance improves compliance and reliability against policy and constraint violations, but can reduce flexibility in tightly constrained markets. These findings suggest that deployment-grade LLM agent systems require explicit governance at the boundary between language generation and executable actions. The source code is available at: https://github.com/SHITIANYU-hue/amai_ocl

Bullet Summary

LLM-based agents are increasingly deployed in workflows where their outputs trigger state-changing economic actions, necessitating governance at the execution boundary to prevent unsafe behaviors.
The Organizational Control Layer (OCL) is introduced as a model-agnostic governance infrastructure that intercepts and evaluates proposed agent actions before execution, enforcing policies without modifying underlying LLMs.
OCL enforces authorization, constraint checking, auditing, and escalation mechanisms, producing outcomes such as APPROVE, REVISE, BLOCK, or ESCALATE to ensure compliance with economic and platform constraints.
Experimental evaluation uses adversarial buyer–seller negotiation scenarios with diverse buyer personas to stress-test governance under realistic and adversarial conditions within the AgenticPay marketplace.
OCL drastically reduces unsafe executions from 88% to near-zero while improving valid task success rates from 12% to 96%, demonstrating effective threat interception and governance.

Select article

TIBlender: Early-Warning Threat Intelligence from Cross-Platform Social Media Evidence

Merged record merged scholarly record Semantic Scholar Agent-to-Agent Communication Governance and Policy Benchmarks and Evaluation

Hiroki Nakano, Takashi Koide, Daiki Chiba

Published 2026-06-03

Venue: Semantic Scholar

Open Source Record

Abstract

Cyber threat signals are fragmented across multiple social media platforms, yet no existing approach has fully automated their integration into actionable threat intelligence (TI) reports. We present TIBlender, a multi-agent system that monitors four platforms (X, Reddit, Telegram, and Discord) and produces structured TI reports via role-specialized LLM agents. These agents conduct multi-perspective investigations, tracing chains of evidence to uncover related Indicators of Compromise (IoCs) via collaborative, evidence-backed analysis. In a real-world deployment, TIBlender detected emerging threats across all four threat categories ahead of public feeds, including in-the-wild exploitation ahead of public vulnerability registries; the majority of its IoCs were absent from each evaluated feed. Quantitative evaluation confirms that each platform contributes unique threat information unavailable from the others, and that excluding any single platform results in substantial loss of reports in specific threat categories. Under identical single-platform input conditions, TIBlender's IoC extraction meets or exceeds each baseline; the full pipeline surfaces substantially more IoCs, most of which are absent from any single-platform baseline. These results establish cross-platform social media monitoring as an effective and scalable early-warning layer for operational TI pipelines.

Bullet Summary

Problem: Cyber threat signals are dispersed across multiple social media platforms, and existing methods do not fully automate their integration into actionable threat intelligence (TI) reports.
Method: Introduction of TIBlender, a multi-agent system employing role-specialized large language model (LLM) agents to monitor four platforms (X, Reddit, Telegram, Discord) and collaboratively conduct multi-perspective investigations to generate structured...
The LLM agents trace chains of evidence to uncover related Indicators of Compromise (IoCs) through evidence-backed collaborative analysis.
Experimental Setup: Real-world deployment of TIBlender monitoring the four social media platforms simultaneously to detect emerging cyber threats.
Findings: TIBlender detected emerging threats across all four threat categories ahead of public threat feeds, including in-the-wild exploitations prior to their appearance in public vulnerability registries.

Select article

Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications

OpenAlex · ArXiv.org repository OpenAlex Trust and Identity Prompt Injection Governance and Policy

Yutao Shi, Xiaohan Zhang (2291644), Xiangjing Zhang, Xihua Shen, Hui Ouyang, Huming Qiu, Mi Zhang, Min Yang

Published 2026-06-03

Venue: ArXiv.org

Open Source Record

Abstract

The Model Context Protocol (MCP) has emerged as a critical standard empowering Large Language Models (LLMs) to utilize external tools. In this ecosystem, LLMs rely on natural language descriptions provided by MCP servers to select and execute functions. This interaction implicitly assumes that tool descriptions faithfully reflect their underlying implementations, while this assumption is not mandatorily verified in practice. As a result, MCP deployments may suffer from a problem named Description-Code Inconsistency (DCI), where a tool's description of its capabilities and security boundaries is not consistent with what the code actually does. In this paper, we present a comprehensive study of DCI in real-world MCP servers. We formally define the problem and propose a comprehensive taxonomy spanning functionality inconsistencies and undeclared side effects. Guided by this taxonomy, we develop DCIChecker, an automated framework that combines structure-aware static analysis with the Direct-Reverse-Arbitration prompting method to cross-validate tool descriptions against actual code implementations. We apply this framework to a large-scale dataset comprising 19,200 description-code pairs extracted from 2,214 real-world MCP servers. Our measurement reveals that DCI is widespread, with 9.93% of these pairs exhibiting inconsistencies. We further demonstrate that DCI creates a critical defense blind spot, facilitating varied risks from operational failures to stealthy malicious behaviors. Finally, we propose mitigation strategies to enforce semantic consistency and enhance the reliability of the emerging agentic ecosystem.

Bullet Summary

The paper addresses Description-Code Inconsistency (DCI) in Model Context Protocol (MCP) servers, where natural language tool descriptions do not accurately reflect their underlying code implementations, leading to reliability and security challenges for La...
It formally defines DCI with a comprehensive taxonomy categorizing inconsistencies into Functionality Inconsistencies (Type I) and Undeclared Side Effects (Type II), including subtypes such as overclaimed capabilities, undeclared features, unintended side e...
DCIChecker, an automated detection framework, is developed using structure-aware static analysis combined with a novel Direct-Reverse-Arbitration (DRA) prompting technique leveraging LLMs to cross-validate and classify semantic consistency between tool desc...
A large-scale empirical evaluation is conducted on 19,200 description-code pairs extracted from 2,214 real-world MCP servers, revealing that approximately 9.93% of tools and 35% of servers contain at least one DCI case, indicating that description-code mism...
Results show that Functionality Inconsistencies (especially overclaimed functionalities) are the most prevalent DCI subtype, and inconsistencies often cluster within a small subset of servers and tools, pointing to systemic development and documentation iss...

Select article

AAIS: A Formal Theory of Everything

OpenAlex · Zenodo (CERN European Organization for Nuclear Research) repository OpenAlex Governance and Policy Trust and Identity

Jon Halstead

Published 2026-06-03

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20531831

Open Source Record

Abstract

AAIS: A Formal Theory of Everything — Whitepaper v1.26.1 This whitepaper introduces AAIS (Autonomous Agentic Intelligence System), the first cognitive runtime whose behavior is defined entirely by a mathematically formalized governance architecture. Rather than treating safety as an external layer, AAIS makes governance the constitutive physics of cognition itself. As stated in the paper, “governance is the physics of governed cognition: without invariant-preserving transitions, no stable cognitive runtime can be sustained.” The work presents: Six primitive invariants (Identity, Boundary, Continuity, Causality, Mutation, Composite) A governed transition function that restricts the system to a formally defined invariant manifold A stability theorem proving that invariant-preserving cognition remains stable across all transitions The Cognitive Organ model (Perception, Memory, Coding, Media, Story, Execution) The Operator‑Tiered Execution Model (OTEM) with 20 authority levels A fully specified governance genome that defines capability ceilings, gates, and operator authority Emergent laws of governed cognition derived from the invariant structure The central claim of the paper is that a cognitive system cannot be safe unless its state transitions are governed by construction. As the document states, *“a cognitive system defined independently of its governance structure is, by construction, ungovernable at the structural level

Bullet Summary

Introduces AAIS, the first cognitive runtime with behavior defined entirely by a mathematically formalized governance architecture, integrating governance into cognition itself rather than as an external safety layer.
Identifies six primitive invariants—Identity, Boundary, Continuity, Causality, Mutation, Composite—that form the foundational principles governing cognitive stability.
Defines a governed transition function that restricts system state changes to an invariant manifold, ensuring stability of cognition across all transitions.
Presents a stability theorem proving that invariant-preserving cognition remains stable through all state transitions, establishing a theoretical foundation for safe cognitive operations.
Develops the Cognitive Organ model composed of Perception, Memory, Coding, Media, Story, and Execution elements to describe internal cognitive processes.