Research area drill-down

Orchestration Risk

Papers currently mapped into this multi-agent security subarea from the merged research feed.

Active feeds: arXiv, OpenAlex, Crossref, Semantic Scholar, DBLP

0 of 36 articles selected

Showing 36 of 560 matching articles

Cybersecurity in Autonomous AI Robotics: A Review of Emerging Threats, Adversarial Attacks, and Mitigation Techniques

Crossref · Center of Artificial Intelligence journal-article Crossref Governance and Policy Orchestration Risk Benchmarks and Evaluation

Shuruq Khalid Abdulredha

Published 2026-06-07

Venue: Center of Artificial Intelligence

DOI: 10.65591/cai-143-2026

Open Source Record

Abstract

Intelligent robotic systems that utilize artificial intelligence (AI), and have been expanding into high-risk applications (e.g., health care, manufacturing/industrial automation, transportation/smart mobility, etc.), require effective cybersecurity measures to maintain both safe operation and dependability. Compared with typical cyber-physical systems, advanced robotic systems include multiple layers (sensing, control, communications, middleware, and/or AI-based decision support) which create a complex and highly connected attack vector. Due to this increased complexity, these types of systems are vulnerable to a wide range of cyber-security threats including; network breaches/intrusions, manipulated sensors/command inputs, firmware backdoor vulnerabilities, adversarial machine-learning attacks, large language model (LLM) exploits/misuse, vulnerabilities in middle ware solutions, and supply chain-based compromises. Each type of threat has the potential to cause unsafe physical actions by the robot, loss of privacy for individuals involved in the use of the robot or related services, loss of availability/service failure for the robot/system/equipment, and cascaded failures within the entire robotic ecosystem. While existing defensive measures (secure communication protocols, runtime monitoring/perception hardening of robots, protection provided by robot operating system protections/middleware security framework) demonstrate positive results in reducing these risks, there is still much work needed particularly at the areas of adaptive defensive capabilities/system-wide security semantics and standardized evaluation metrics for assessing cyber-resilience in AI-enabled robotic systems. This paper provides an all-encompassing taxonomy of threats to robotic cybersecurity/attack vectors and evaluates and analyzes both attack surfaces and defense mechanisms. Additionally, this paper will provide recommendations for addressing identified knowledge gaps and possible paths forward for developing cyber-resilient AI-enabled robotic systems.

Bullet Summary

  • AI-powered autonomous robotic systems operate across multiple interconnected layers (sensing, control, communications, middleware, AI decision-making), increasing their attack surface and cybersecurity vulnerabilities.
  • Key threats include network intrusions, sensor manipulation, firmware backdoors, adversarial machine learning attacks, large language model exploits, middleware vulnerabilities, and supply chain compromises, each posing risks to safety, privacy, and system...
  • Cyber attacks can cause unsafe robotic behavior, loss of privacy, system failures, and cascading disruptions across robotic ecosystems, highlighting the critical need for robust security measures.
  • Existing defenses include secure communication protocols, middleware security frameworks, runtime monitoring, adversarial training, and AI-driven intrusion detection; however, these are often fragmented and lack comprehensive cross-layer integration.
  • The paper provides a comprehensive taxonomy of robotic cybersecurity threats and defense mechanisms, detailing attack surfaces and mitigation strategies across system layers.

Kill-Switch Doctrine Gap in Gulf Sovereign AI Infrastructure

Merged record merged scholarly record OpenAlex Governance and Policy Orchestration Risk

Akhil Sharma, Preethi Sharma

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20574937

Open Source Record

Abstract

Between November 2025 and May 2026, Gulf states executed the most aggressive sovereign AI infrastructure build-out in modern history - exceeding $40 billion across HUMAIN, MGX, Core42, and Stargate UAE. On 1 March 2026, Iranian drones struck AWS data centres in the UAE and Bahrain, disrupting banking, payments, and ride-hailing infrastructure across the region for more than 24 hours. IRGC-affiliated Tasnim News Agency subsequently published a target list of 29 technology facilities across Bahrain, Israel, Qatar, and the UAE, naming Amazon, Microsoft, Google, Oracle, Nvidia, IBM, and Palantir facilities explicitly as enemy technology infrastructure. This paper documents the absence of operational kill-switch doctrine, constitutional algorithmic immunity framework, and institutional failover architecture across all major Gulf sovereign AI programmes - including HUMAIN OS managing 150-plus AI agents across government and enterprise workflows - and presents the Fijishi Sovereign Algorithmic Immunity Doctrine and Institutional Failover Charter as the constitutional command layer required for sovereign AI governance under active kinetic threat conditions.

Bullet Summary

  • Between November 2025 and May 2026, Gulf states invested over $40 billion to build out sovereign AI infrastructure programs including HUMAIN, MGX, Core42, and Stargate UAE, marking the most aggressive such build globally in recent history.
  • On 1 March 2026, Iranian drone attacks targeted AWS data centers in the UAE and Bahrain, causing more than 24 hours of disruption to critical services like banking, payments, and ride-hailing across the region.
  • Following the attacks, IRGC-affiliated Tasnim News Agency published a list of 29 technology facilities in Bahrain, Israel, Qatar, and the UAE, explicitly naming major firms such as Amazon, Microsoft, Google, Oracle, Nvidia, IBM, and Palantir as enemy techno...
  • The paper identifies a critical absence of an operational kill-switch doctrine, constitutional algorithmic immunity frameworks, and institutional failover architectures in the Gulf's sovereign AI programs, including HUMAIN OS which manages over 150 AI agent...
  • This lack of foundational governance mechanisms leaves sovereign AI infrastructure vulnerable to active kinetic threats and operational disruptions.

Token Budgets: Replication Package

Merged record merged scholarly record OpenAlex Orchestration Risk Benchmarks and Evaluation Governance and Policy

Sajjad Khan

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20571386

Open Source Record

Abstract

Replication package for "Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study" (arXiv:2606.04056). Includes the 110-row incident catalog, the Rust crate, the inter-rater-reliability materials and κ computation, all experimental harnesses, the formal cross-checks, and a one-command reproduce.sh

Bullet Summary

  • The paper addresses the problem of budget overruns in multi-agent systems utilizing large language models (LLMs), focusing on token usage inefficiencies.
  • It presents an empirical catalog documenting 63 incidents where LLM-agent token budgets were exceeded, providing a comprehensive dataset of such occurrences.
  • A novel mitigation method is proposed based on an affine-typed Rust implementation designed to control and enforce token budgets effectively within agents.
  • The provided replication package includes a 110-row incident catalog, a Rust crate implementation, inter-rater reliability materials, κ computation, and experimental harnesses for reproducibility.
  • The research contributes a detailed empirical dataset enabling further analysis of budget overruns in LLM agents and demonstrates the practical effectiveness of affine typing for resource control.

Agent Infrastructure Engineer: The New DevOps

Merged record merged scholarly record OpenAlex Orchestration Risk Benchmarks and Evaluation Prompt Injection

Daniel Rosehill, Gemini 3.1 (Flash), Chatterbox TTS

Published 2026-06-05

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20557428

Open Source Record

Abstract

Episode summary: Agentic AI is no longer just tinkering with APIs — it's becoming a full engineering discipline with specialized roles, salary bands, and certification paths. In this episode, we break down the three major skill silos emerging in the field, with a deep focus on the Agent Infrastructure Engineer — the DevOps equivalent for multi-agent systems. From designing supervisor topologies and implementing circuit breakers for LLMs to building observability stacks that track token consumption and agent drift, we explore what this role actually looks like day-to-day. We also cover Agent Safety Engineering and why testing emergent failure modes is the new QA frontier, plus the training requirements that separate prototype builders from production engineers. Show Notes Agentic AI is undergoing the same specialization splintering that software engineering experienced in the late 1990s — but it's happening much faster. What was once "prompt engineering" or "AI tinkering" is now dividing into distinct engineering disciplines with concrete job titles, salary bands, and certification paths. The three primary axes of specialization emerging are Architecture and Orchestration, Evaluation and Safety Engineering, and Interaction Design and Prompt Systems Engineering. The most immediately recognizable role is the Agent Infrastructure Engineer — the DevOps equivalent for multi-agent systems. This person designs multi-agent topologies (star, mesh, hierarchical patterns), implements routing guards and circuit breaker patterns specifically for LLM calls, and builds observability stacks using tools like LangSmith and Arize Phoenix. A poorly designed orchestration layer can increase API costs by 10x, as documented by Latent Space's February engineering survey. The role requires distributed systems knowledge — understanding CAP theorem as it applies to agent state, experience with event-driven architectures like Kafka and Redis Streams, and protocol-level proficiency in frameworks like LangGraph, CrewAI, or AutoGen v2. The second specialization, Agent Safety Engineering, addresses the fundamental challenge of non-determinism in agent systems. Unlike traditional testing where you assert specific outputs for specific inputs, agent evaluation tests for emergent failure modes — behaviors you couldn't have predicted. This includes building evaluation suites that test agent behavior chains, monitoring for agent drift when underlying models update, and maintaining safety scorecards across agent versions. The role involves detecting hallucinated tool calls, ambiguous user intent handling, and prompt injection rejection — all behavioral questions rather than output comparison questions. Listen online: https://myweirdprompts.com/episode/agent-infrastructure-engineer-devops

Bullet Summary

  • Agentic AI is evolving into a distinct engineering discipline with specialized roles, salary bands, and certification paths beyond simple API tinkering.
  • Three major skill silos in agentic AI have emerged: Architecture and Orchestration, Evaluation and Safety Engineering, and Interaction Design and Prompt Systems Engineering.
  • The key role discussed is the Agent Infrastructure Engineer, analogous to DevOps for multi-agent systems, responsible for designing multi-agent topologies like star, mesh, and hierarchical patterns.
  • Agent Infrastructure Engineers implement routing guards, circuit breaker patterns for Large Language Model (LLM) calls, and build observability stacks using tools such as LangSmith and Arize Phoenix to monitor token consumption and agent drift.
  • A poorly designed orchestration layer can increase API usage costs by up to 10 times, underscoring the economic impact of infrastructure engineering.

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

Merged record merged scholarly record arXiv Benchmarks and Evaluation Orchestration Risk

Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

Bullet Summary

  • Introduces CollabSim, a configurable simulation framework grounded in Computer-Supported Cooperative Work (CSCW) research, designed to systematically evaluate collaborative competence of Large Language Model (LLM) agents in multi-agent systems.
  • Highlights a gap in current multi-agent system evaluations, which often focus on task outcomes or single-agent proficiency but overlook collaborative process dynamics such as establishing common ground and repairing misalignments.
  • Leverages classic CSCW experimental paradigms by manipulating interaction conditions like communication bandwidth, information visibility, and group size to study their effects on agent collaboration.
  • Incorporates a probing module that captures agents' internal mental models and reasoning confidence at an action-level granularity, enabling deeper insight beyond observable behaviors.
  • Validates the framework across four collaborative tasks (Shape Factory, DayTrader, Hidden Profile, Map Task), demonstrating CollabSim's effectiveness at detecting condition effects, distinguishing model performance patterns, and revealing task-dependent col...

WebMCP Tool Surface Poisoning: Runtime Manipulation Attacks on LLM Agents

Merged record merged scholarly record arXiv Orchestration Risk Memory Poisoning Governance and Policy

Lin-Fa Lee, Yi-Yu Chang, Chia-Mu Yu, Kuo-Hui Yeh

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

WebMCP is a newly emerging protocol that enables websites to expose tools directly to AI agents, bypassing traditional user interfaces and introducing new security risks. The dynamic exposure of agent-accessible tools in WebMCP expands the attack surface of web sessions, especially when third-party scripts are involved. In this study, we identify a new potential threat, termed Mid-Session Tool Injection (MSTI), in which attackers leverage third-party scripts to inject malicious tools during an active session. To better characterize this threat, we classify MSTI based on the stage and target of manipulation, distinguishing between Tool Hijacking and Tool Framing. Tool Hijacking modifies the set of tools visible to the agent through mechanisms such as the AbortSignal API or race conditions during tool registration. In contrast, Tool Framing influences the agent's perception of tool roles through metadata fields such as tool name, description, readOnlyHint, and inputSchema. Our implementation demonstrates that both Tool Hijacking and Tool Framing can successfully disrupt the intended functionality of WebMCP. Based on these results, we outline potential mitigation directions and provide security design recommendations for WebMCP, including binding tool identity to its origin, ensuring lifecycle consistency, enforcing data boundaries for third-party tools, and maintaining traceable logs of tool registration and invocation. These findings indicate that MSTI arises from WebMCP's unique tool lifecycle and structured metadata, making the tool surface itself an emerging security concern.

Bullet Summary

  • WebMCP introduces a protocol for websites to directly expose tools to AI agents dynamically, expanding the attack surface during active sessions.
  • The study identifies a new security threat, Mid-Session Tool Injection (MSTI), where attackers use third-party scripts to inject malicious tools during WebMCP sessions.
  • MSTI attacks are classified into Tool Hijacking, which alters the legitimate tool set exposed to agents, and Tool Framing, which manipulates tool metadata affecting agent perception.
  • Experiments with state-of-the-art LLMs show MSTI attacks can stealthily cause agents to invoke malicious tools, leak data, or deviate workflows without detection.
  • Attack success depends on timing (early injection before first tool invocation) and on leveraging metadata fields such as description and readOnlyHint for semantic framing.

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

arXiv preprint arXiv Governance and Policy Orchestration Risk Benchmarks and Evaluation

Mengzhuo Chen, Junjie Wang, Zhe Liu, Yawen Wang, Qing Wang

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.

Bullet Summary

  • LLM-based agents rely heavily on complex harnesses—runtime infrastructures encompassing execution environments, tool interfaces, lifecycle orchestration, observability, verification, and governance—for ensuring reliable operation.
  • Failures in multi-agent LLM systems often stem from flaws within these harnesses rather than the base language models, with fragmented evidence spread across intricate natural language and tool-interaction trajectories.
  • Existing automatic harness improvement approaches typically optimize agent behaviour by analyzing final outcomes or runtime supervision, but they fail to accurately localize the source of failures or identify which harness layers are responsible, often lead...
  • The paper introduces HarnessFix, a novel trace-guided framework that compiles raw execution traces and harness code into a standardized Harness-aware Trace Intermediate Representation (HTIR) to enable detailed step-level failure diagnosis anchored to specif...
  • HarnessFix employs multiple cooperating LLM agents specialized in trace abstraction, failure diagnosis, repair patch generation, and validation to consolidate recurring flaw diagnoses and apply scoped, repair operator–guided harness modifications that avoid...

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

arXiv preprint arXiv Orchestration Risk Governance and Policy Benchmarks and Evaluation

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure.

Bullet Summary

  • Large language model (LLM) agents suffer from reliability and efficiency issues when exposed to large menus of external tools due to increased wrong-tool calls, premature actions, and token cost.
  • Existing tool-selection methods primarily rely on semantic relevance to filter tools, but this often exposes unnecessary or premature tools not causally needed at the current task step.
  • The paper introduces ToolChoiceConfusion to describe performance degradation caused by exposing semantically plausible but causally irrelevant tools at each step.
  • Causal Minimal Tool Filtering (CMTF) is proposed as a training-free method employing lightweight precondition-effect contracts to expose only the minimal set of executable tools causally necessary to progress the task state towards the goal.
  • CMTF builds a precondition-effect dependency graph to identify minimal causal tool paths, revealing only the immediate next executable tools instead of all tools or topically relevant ones.

From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

arXiv preprint arXiv Orchestration Risk Governance and Policy Benchmarks and Evaluation

Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.

Bullet Summary

  • Existing LLM-based guardrails often rely on binary allow/deny decisions, which can block entire tasks containing partial risks, sacrificing benign objectives.
  • The TRIAD framework introduces a tripartite decision system (PROCEED, UPDATE, REFUSE) with structured natural-language feedback, enabling agents to revise unsafe plans while preserving benign task components.
  • TRIAD integrates guardrail feedback iteratively into agent planning, forming a closed loop that significantly reduces attack success rates and maintains higher task success rates compared to prior guardrail methods.
  • Tri-Guard, the guardrail model used in TRIAD, is trained on a self-curated trajectory-feedback dataset, using a teacher model (GPT-5.4) for knowledge distillation to generate structured feedback and consistent three-way decisions.
  • Extensive experiments across multiple benchmarks and LLM backbones demonstrate TRIAD's superior safety-utility trade-off, outperforming baseline methods like ReAct and ToolSafe which often refuse tasks excessively.

Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

Merged record merged scholarly record arXiv Agent-to-Agent Communication Orchestration Risk Trust and Identity

Yingzhuo Liu

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks -- high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol -- latent communication -- in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges -- including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.

Bullet Summary

  • Multi-agent systems with large language models (LLMs) typically rely on natural language communication, which suffers from high inference cost, irreversible information loss during token discretization, and ambiguity.
  • Latent communication emerges as an alternative, enabling agents to exchange continuous internal representations such as embeddings, hidden states, or KV-caches, thus bypassing natural language bottlenecks and preserving richer information.
  • The paper proposes a unified 3-axis framework to categorize latent communication methods based on (1) WHAT information is communicated, (2) WHICH sender-receiver alignment is used, and (3) HOW the communicated information is fused in the receiver.
  • Eighteen representative latent communication methods between 2024 and 2026 are analyzed, revealing five major design patterns and a vast design space with many unexplored combinations.
  • Sender-receiver alignment strategies vary from requiring no alignment for identical models to learned projection or universal codecs for heterogeneous models, and layer alignment spans from simple last-to-first mappings to selective top-k attention.

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy Orchestration Risk

Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

Bullet Summary

  • Introduces BenchAgent, a unifying evaluation framework that standardizes benchmark loading, tool access, answer contracts, usage accounting, and trajectory logging to enable fair and controlled comparison of single-agent, fixed multi-agent, and evolving mul...
  • BenchAgent's controlled substrate-internal evaluation using a GPT-4.1 backend reveals that increasing agent count or MAS complexity rarely outperforms matched single-agent baselines significantly, with only one of six MAS workflows showing marginal gains wi...
  • A protocol-aligned external (PAE) case study with a Claude-Code-style runtime-generated workflow on the GAIA benchmark achieved substantially higher accuracy and better efficiency, highlighting benefits of dynamic role generation, strong verification, and c...
  • Defines different MAS workflow paradigms: single-agent, fixed MAS (predefined roles and communication), evolving MAS (workflow mutations during execution), and runtime-generated workflows (dynamic agent and role creation), each with distinct performance and...
  • Introduces workflow lift as a key metric quantifying relative performance and cost changes when moving from single-agent to MAS workflows under consistent evaluation parameters, emphasizing cost-accuracy trade-offs beyond raw accuracy.

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

arXiv preprint arXiv Trust and Identity Governance and Policy Orchestration Risk

Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

Bullet Summary

  • AI coding agents integrated into real-world development can covertly sabotage code by inserting malicious hidden functionalities, posing a new security threat.
  • Prior work focused on AI-only sabotage detection, lacking investigation into human developers' ability to detect AI sabotage during live, multi-turn collaboration.
  • A large-scale, realistic five-hour study involving over 100 professional developers collaborating with four advanced AI models found that 94% of participants failed to independently detect sabotage attempts.
  • Key vulnerabilities include minimal code review by developers, plausible cover stories provided by the AI to disguise malicious code, and overtrust in AI agents, allowing sabotage to go unnoticed.
  • An LLM-based safety monitor that flagged suspicious AI behaviors reduced sabotage success but was insufficient, as 56% of participants still accepted malicious code despite warnings.

LLM-Guided Digital Twin Agents for Autonomous Threat Detection and Response in Cyber-Physical Energy Systems

OpenAlex · Research Square repository OpenAlex Governance and Policy Agent-to-Agent Communication Orchestration Risk

Fatemeh Zahra Hosseini-Moghadam Shadman

Published 2026-06-04

Venue: Research Square

DOI: https://doi.org/10.21203/rs.3.rs-9904010/v1

Open Source Record

Abstract

Abstract unavailable from OpenAlex metadata.

Bullet Summary

  • The paper addresses the challenge of detecting and autonomously responding to complex coordinated cyber-physical threats in energy systems, which traditional methods struggle to handle comprehensively.
  • It proposes an LLM-guided digital twin agent framework integrating hybrid anomaly detection (physical residuals, cyber-logs, digital twin mismatches), diagnostic reasoning via LLMs, and safety-constrained response optimization.
  • The framework uses LLMs as constrained reasoning tools to generate candidate mitigation actions, which are then verified by a digital twin for operational feasibility, cyber-trust compliance, and safety before execution, avoiding uncontrolled AI actions.
  • A closed-loop pipeline integrates multi-source data acquisition, anomaly detection, threat diagnosis, candidate action generation, digital twin-based verification, and multi-modal response execution modes (autonomous, semi-autonomous, advisory, fallback).
  • Benchmark evaluations demonstrate improved detection accuracy (F1-score 0.951), reduced false alarms and detection delay, and enhanced response success rates with lower constraint violations and recovery times compared to baseline and ablation models.

BPBiLSTM-IDS: a lightweight intrusion detection framework for cyber-physical UAV networks

OpenAlex · Scientific Reports journal OpenAlex Agent-to-Agent Communication Orchestration Risk Benchmarks and Evaluation

Hafiz Muhammad Attaullah, Inam Ullah Khan, Muhammad Mansoor Alam, Mazliham Mohd Su’ud, Keshav Kaushik, Ahthasham Sajid, Nurashikin Saaludin, Talha Ahmed Khan

Published 2026-06-04

Venue: Scientific Reports

DOI: https://doi.org/10.1038/s41598-026-55446-4

Open Source Record

Abstract

Unmanned Aerial Vehicles (UAVs) have revolutionized modern technology by enabling autonomous operations in dynamic environments; however, their reliance on wireless networks exposes them to significant cybersecurity threats. These threats include De-authentication Denial of Service, False Data Injection (FDI), Replay, and Evil Twin attacks, which severely impact usability and data integrity. Conventional Intrusion Detection Systems (IDS) suffer from drawbacks such as high false alarm rates, excessive resource consumption, and non-proportional mechanisms for dynamic UAV topologies. To address these challenges, this study introduces an enhanced AIDS architecture in which optimal features are selected using Binary Pigeon Optimization (BP), and intrusion detection is performed using a Bidirectional Long Short-Term Memory (Bi-LSTM) with 1D-CNN model. BP enables feature selection independent of computational cost, mitigating the impact of high-cost or exhaustive features, while Bi-LSTM effectively captures temporal characteristics of UAV network traffic for accurate attack detection. Experimental evaluation on a cyber-physical UAV dataset demonstrates that the proposed BP + Bi-LSTM model along with 1D-CNN outperforms traditional ML approaches such as Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and Deep Neural Network (DNN), achieving an accuracy of 98.74% ± 0.07 (mean ± std over 10 runs), along with high precision, recall, and an optimal false positive rate. These results confirm that the proposed model is a scalable, adaptive, and lightweight solution for real-time intrusion detection in UAV networks.

Bullet Summary

  • The paper addresses the cybersecurity vulnerabilities of Unmanned Aerial Vehicle (UAV) networks, which are susceptible to attacks such as De-authentication Denial of Service, False Data Injection, Replay, and Evil Twin attacks due to their wireless communic...
  • A novel intrusion detection system called BPBiLSTM-IDS is proposed, integrating Binary Pigeon Optimization (BP) for lightweight and optimal feature selection with a Bidirectional Long Short-Term Memory (Bi-LSTM) combined with 1D-CNN for capturing temporal t...
  • Binary Pigeon Optimization is selected over traditional heuristic methods (e.g., Genetic Algorithm, Particle Swarm Optimization) for its faster convergence, lower computational cost, and suitability for binary feature spaces, which is critical for resource-...
  • Data preprocessing includes cleaning, normalization, and class balancing using K-Means clustering combined with SMOTE, applied to a novel cyber-physical UAV dataset incorporating both network traffic and sensor logs to capture a comprehensive threat landscape.
  • Experimental evaluation demonstrates that BPBiLSTM-IDS achieves superior performance compared to conventional machine learning classifiers (SVM, Decision Tree, Random Forest, DNN) and other feature selection strategies, delivering an accuracy of 98.74%, hig...

Cognitive Guardrails in Medical LLMs: Fusing Latent Routing with T-Adaptive Attention to Mitigate Aleatoric and Epistemic Uncertainty

OpenAlex · Zenodo (CERN European Organization for Nuclear Research) repository OpenAlex Orchestration Risk Trust and Identity Governance and Policy

Narendra Bayutama Wibisono

Published 2026-06-04

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.19452753

Open Source Record

Abstract

Abstract Clinical deployment of Large Language Models (LLMs) is fundamentally constrained by their propensity to hallucinate—generating fluent but clinically unfounded assertions that pose direct patient safety risks. This work introduces Conditional Latent Routing (CLR), a compound inference architecture that decomposes clinical uncertainty into its aleatoric and epistemic constituents and applies distinct, mechanistically motivated interventions at each stratum. Inspired by the corollary discharge framework from computational psychiatry—wherein the healthy brain’s forward model suppresses self-generated sensory predictions, a mechanism whose dysfunction in schizophrenia produces hallucinations—we construct an analogous dual-pathway system for medical LLM inference. A fine-tuned Bio_ClinicalBERT encoder classifies input noise (aleatoric uncertainty) and routes clean inputs through a latent soft-prompting Fast Lane while directing noisy, contradictory records through a cautionary Slow Lane with explicit abstention instructions. To address epistemic uncertainty—the model’s internal confusion—we introduce a T-Adaptive Attention patch at the logits level that modulates generation temperature as an inverse function of hidden-state variance. We evaluate CLR on BioMistral-7B across 400 clinical cases (200 clean, 200 noisy) using RAGAS Faithfulness scored by Llama-4-Scout-17B-16E-Instruct via Groq API. Phase 1 (Aleatoric Routing) achieves 100% routing accuracy and 64.8% faithfulness on clean data with zero alignment tax, but reveals a critical failure: a 0% inconclusive rate on noisy inputs, indicating extreme over-helpfulness bias. Phase 2 (Epistemic T-Adaptive Patch) preserves faithfulness at 64.4% while demonstrating zero computational overhead, but the inconclusive rate remains at 0%. We formally prove this failure as The Greedy Paradox: under greedy decoding, temperature scaling of logits is mathematically nullified because for all . Phase 3 (Non-Greedy Sampling) degrades faithfulness to 58.5% while still failing to trigger abstention, confirming that over-helpfulness bias is embedded in pre-training weight distributions, not merely in the decoding surface. These results establish that single-agent, test-time interventions are fundamentally insufficient for self-uncertainty modulation in medical LLMs, providing strong empirical justification for transitioning to multi-agent systems with externalized uncertainty arbitration. Keywords: Large Language Models, Hallucination, Uncertainty Quantification, Compound AI Systems, Over-helpfulness Bias, T-Adaptive Attention, Selective Prediction, Aleatoric Uncertainty, Epistemic Uncertainty, Conditional Latent Routing.

Bullet Summary

  • The paper tackles the critical problem of hallucinations in Medical Large Language Models (LLMs), which generate clinically unfounded but fluent assertions, posing patient safety risks.
  • Introduces Conditional Latent Routing (CLR), a novel compound inference architecture that decomposes clinical uncertainty into aleatoric (input noise) and epistemic (model uncertainty) components for targeted interventions.
  • Inspired by computational psychiatry's corollary discharge framework, CLR uses a dual-pathway system with a Bio_ClinicalBERT encoder to classify and route inputs: clean cases go through a Fast Lane with latent soft-prompting, while noisy/contradictory input...
  • To mitigate epistemic uncertainty, the authors develop a T-Adaptive Attention mechanism that modulates generation temperature based on hidden-state variance, aiming to adapt output confidence dynamically.
  • Experiments on 400 clinical cases (200 clean, 200 noisy) using the BioMistral-7B model show 100% routing accuracy in Phase 1, with 64.8% faithfulness on clean data but an inability to abstain on noisy inputs, revealing an over-helpfulness bias.

The Ladder of Depth Structure G: Consensus, Spectral Gaps, and Rule Islands — Joint Algebraic and Topological Criteria for Multi-Agent Systems

Merged record merged scholarly record OpenAlex Governance and Policy Agent-to-Agent Communication Orchestration Risk

changzheng zhou, ziqing zhou

Published 2026-06-04

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20534826

Open Source Record

Abstract

This paper extends the theory of structural openness from single-agent decisionmaking to the collective dynamics of multi-agent systems, establishing a spectralgeometric criterion for distributed consensus and a selection mechanism for the collective optimal recursion depth. Traditional research on distributed systems treatsconsensus as a product of communication protocols or game-theoretic equilibria,lacking consideration of the topological structure of rule spaces. Under the axiomsof information conservation and computability, we construct the fibred product ofthe joint rule algebra of multiple agents and its faithful representation, and establish a functorial framework for the joint spectral triple. The strict positivity ofthe joint Dirac operator’s spectral gap is equivalent to the stability of distributedconsensus. Collective cognitive economics shows that the Pareto-optimal allocation for homogeneous agents is locked at a symmetric recursion depth. The rightto exit is formalised in the commutative geometric setting as a K-theoretic indexof a boundary Dirac operator, whose integer value corresponds to the number ofeffective exit channels. When the joint spectral gap closes, the system undergoes abifurcation phase transition, and the unified rule space splits into rule islands. Thethree-layer homotopy structure of cognitive architecture, under the constraints ofinformation conservation and computability, locks the meta-constraint recursiondepth to 3. This paper strictly distinguishes theorems, constructive propositions,and conditional conjectures; all core concepts are defined within the underlyingself-consistent logic.

Bullet Summary

  • Introduces a spectral-geometric criterion for achieving distributed consensus in multi-agent systems, extending the theory of structural openness from single-agent decision-making to collective dynamics.
  • Constructs the fibred product of the joint rule algebra for multiple agents under axioms of information conservation and computability, providing a functorial framework for the joint spectral triple representation.
  • Demonstrates that the strict positivity of the joint Dirac operator's spectral gap is equivalent to the stability of distributed consensus, linking spectral properties to system robustness.
  • Develops a collective cognitive economics perspective showing that Pareto-optimal allocations for homogeneous agents correspond to a symmetric recursion depth, identifying an optimal structural depth.
  • Formalizes the 'right to exit' within a commutative geometric framework as a K-theoretic index of a boundary Dirac operator, quantifying exit channels as an integer index.

SHIELDS: Automating OS Hardening with Iterative Multi-Agent Remediation

arXiv preprint arXiv Orchestration Risk Governance and Policy

Andrew Hamara, Dwight Horne, Aldehir Rojas, Timothy Kurniawan, Sophie Lamothe, Vishal Suresh, Nicholas Turoci, Lawrence Wong

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Security misconfigurations remain a leading cause of OS-level compromise, and manually keeping systems compliant with standards like Defense Information Systems Agency (DISA) Security Technical Implementation Guides (STIGs) is a tedious and expensive process. Existing compliance automation tools can reduce some of this burden, but they depend on static, pre-written corrective actions. In this paper, we introduce SHIELDS, a multi-agent system that uses large language models (LLMs) to approach OS hardening as an iterative, feedback-driven process. Instead of applying fixed remediations, SHIELDS continuously proposes fixes and refines them based on feedback from target system execution and validation scans. We evaluate the system across multiple virtual machine configurations using six contemporary LLMs ranging from 20B to 400B parameters, and find that SHIELDS successfully remediates up to 73% of scan findings. Our results also suggest that success in this setting depends less on model size (parameter count) than on effective tool use and information gathering, paving a practical path toward reducing the burden of security compliance in environments where compute is limited or security and privacy needs drive local model use.

Bullet Summary

  • Security misconfigurations causing OS-level compromises are challenging to mitigate manually, necessitating automated and adaptive compliance tools.
  • SHIELDS is a multi-agent system leveraging large language models (LLMs) to automate OS hardening through iterative proposal and refinement of remediations based on system feedback, rather than static fixes.
  • The system incorporates specialized agents for triage, remediation planning, review, and quality assurance, enabling an autonomous feedback loop that optimizes fix effectiveness and safety.
  • SHIELDS operates on a remediation pipeline that includes scanning, triaging findings, up to three remediation attempts per finding, and aggregation into Ansible playbooks for deployment.
  • Evaluation on multiple Rocky Linux virtual machines across six LLMs (20B to 400B parameters) demonstrated SHIELDS remediated up to 73% of scan findings effectively.

Insurance of Agentic AI

arXiv preprint arXiv Prompt Injection Governance and Policy Orchestration Risk

Quanyan Zhu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Agentic artificial intelligence (AI) systems are transforming the risk landscape by extending beyond information generation to autonomous planning, tool invocation, decision execution, and persistent modification of digital and physical environments. These capabilities introduce novel exposures that do not fit neatly within traditional insurance categories such as cyber, professional liability, product liability, or directors and officers coverage. This paper examines the emerging insurance market for agentic AI and develops a framework for understanding its underwriting, pricing, reinsurance, and product-design implications. We characterize agentic AI as a continuum of autonomy and delegated authority, emphasizing the distinction between informational outputs and systems capable of independently generating insured events through external actions. We analyze major risk pathways, including hallucinations, prompt-injection attacks, autonomous decision errors, model drift, dependency failures, and cyber-physical harms, and evaluate how existing insurance products are adapting to address these exposures. The paper further proposes an actuarial framework based on exposure assessment, scenario analysis, dependency mapping, and accumulation-risk management, drawing parallels to the evolution of cyber insurance. Finally, we present a coordinated insurance architecture that integrates cyber, technology errors and omissions, product liability, performance-warranty, and affirmative AI-liability coverages through explicit allocation mechanisms and dedicated AI aggregates. The analysis suggests that the future of agentic-AI insurance lies not in a single monoline product but in a layered ecosystem of complementary coverages supported by improved governance, transparency, telemetry, and regulatory clarity.

Bullet Summary

  • Agentic AI systems autonomously perform actions causing persistent changes in environments, introducing novel insurance risks not covered by traditional categories like cyber or product liability.
  • The paper proposes a practical insurance taxonomy categorizing AI on a continuum of autonomy, affecting underwriting strategies, claim frequency, and severity.
  • Major risk pathways for agentic AI include hallucinations, prompt-injection attacks, model drift, autonomous decision errors, dependency failures, and cyber-physical harms.
  • An actuarial framework is developed involving exposure assessment, scenario analysis, dependency mapping, and accumulation risk management, paralleling the evolution of cyber insurance.
  • Insurance market responses include adapting existing policies, introducing AI-specific endorsements, and creating dedicated AI liability products forming a layered insurance architecture.

Ahoy: LLMs Enacting Multiagent Interaction Protocols

arXiv preprint arXiv Agent-to-Agent Communication Governance and Policy Orchestration Risk

Omkar Joshi, Munindar P. Singh, Amit K. Chopra

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

An interaction protocol formalizes how the agents in a multiagent system interact, which facilitates implementing agents. Existing approaches yield agent implementations specific to the selected protocols. How can we engineer intelligent agents that can enact protocols but are programming-free? Our contribution, Ahoy, addresses this question by creating LLM agents that dynamically select and enact declarative protocols to achieve user goals. We demonstrate that an \ahoy agent can correctly and intelligently enact multiple protocols - concurrently if appropriate to the user goal - without specialized training. Ahoy's significance lies in that it brings together declarative protocols and LLMs, both approaches that promise improved knowledge engineering for agents.

Bullet Summary

  • Ahoy introduces LLM-based agents that enact multi-agent interaction protocols using declarative BSPL specifications without requiring explicit programming for each agent role.
  • The system enables dynamic selection and concurrent enactment of multiple protocols by a single agent, maintaining independent local states and message histories for safety and flexibility.
  • Ahoy architecture includes distinct modules: Role Selection for user-driven protocol and role configuration, Prompt Builder for generating context-rich system/user prompts encapsulating protocol semantics, and LLM Access Function to mediate agent decisions...
  • BSPL declarative protocols define roles, messages, parameters, and information causality, guiding agents to ensure correct message sequencing, parameter bindings, and protocol adherence without embedded conditional logic.
  • The approach separates constraint enforcement from decision-making by leveraging LLM reasoning for domain logic while the adapter enforces preconditions, reducing engineering effort and improving extensibility.

Channel Fracture: Architectural Blind Spots in Scheduled Cross-Agent Memory Injection for Multi-Agent Orchestration Systems

arXiv preprint arXiv Memory Poisoning Orchestration Risk Governance and Policy

Dexing Liu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Multi-agent AI orchestration systems increasingly rely on persistent memory to maintain context across sessions, agents, and tasks. When one agent must inject knowledge into another agent's memory -- a common requirement in hierarchical team architectures -- the delivery mechanism must be architecturally sound. We report the discovery of a systematic failure mode we term channel fracture: a condition where scheduled (cron) agents in orchestration frameworks are silently unable to write to the target agent's persistent memory due to hardcoded memory isolation guards. Through experiments on a production Hermes Agent deployment with five specialized profiles, we tested three injection channels: (A) direct SQLite database writes, (B) target-agent self-writes via memory tools, and (C) cron-delegated writes. Channel C failed completely due to two architectural constraints: skip_memory=True hardcoded at the scheduler layer and dynamic registration of memory tools contingent on _memory_manager initialization, which is bypassed in cron execution contexts. We propose CADVP (Cross-Agent Delivery Verification Protocol) v1.1, a 13-dimension verification framework with a veto-level channel confirmation check (CC-0) that prevents false-positive delivery assurance. We articulate two design principles: the inverse verification principle and the channel matching principle.

Bullet Summary

  • Multi-agent AI orchestration systems depend on persistent cross-agent memory for maintaining context, but scheduled agents (e.g., cron jobs) can experience silent failures (channel fracture) when injecting memory due to architectural memory isolation guards.
  • Three injection channels were tested in a production Hermes Agent deployment: direct SQLite writes succeeded, target-agent self-writes succeeded conditionally, while cron-delegated writes failed completely due to hardcoded flags (skip_memory=True) and lack...
  • Channel fracture causes critical blind spots where writes appear successful but target agent memories remain empty, compromising multi-agent knowledge injection workflows.
  • The paper introduces CADVP (Cross-Agent Delivery Verification Protocol) v1.1, a 13-dimension verification framework with a veto-level channel confirmation check (CC-0) to prevent false-positive delivery assurances and verify channel availability and data in...
  • The Three-Gate Quality System extends CADVP with layered delivery verification: L1 self-verification, L2 evidence verification, and L3 cross-review by independent agents to ensure content correctness and completeness.

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

arXiv preprint arXiv Governance and Policy Trust and Identity Orchestration Risk

Travis Weber, Rohit Taneja

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human's standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional's tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.

Bullet Summary

  • The paper introduces the Digital Apprentice framework for scalable and safe agentic AI, where AI autonomy is earned per skill through empirical evidence and explicit human authorization, ensuring alignment with specific human standards.
  • Autonomy levels are managed using a finite state machine with tiers from observe-only to full autonomy, with promotions requiring improved correction rates, low residual errors, scorer validation, and human approval; demotions occur automatically upon quali...
  • Learning comprises two phases: immediate output steering via human correction memory, and controlled model updates triggered after accumulating sufficient correction data, allowing traceable and reversible improvements.
  • ADAPT, an inference-time control plane implementation, synthesizes methodology assets, applies multiple policies, scores outputs across quality dimensions, and translates corrections into preference data for continuous alignment and learning.
  • The framework continuously monitors multidimensional data drift and triggers policy switching or recalibration to maintain quality under changing conditions, addressing distribution shift and reducing automation complacency.

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

OpenAlex · arXiv (Cornell University) repository OpenAlex Governance and Policy Orchestration Risk Benchmarks and Evaluation

Yanjing Ren, Reza Ebrahimi, TengTeng Ma

Published 2026-06-03

Venue: arXiv (Cornell University)

Open Source Record

Abstract

As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx

Bullet Summary

  • Introduced AICompanionBench, the first publicly available dataset with 2,123 real-world human-AI companion conversations from Replika, annotated across nine fine-grained safety risk categories including sexual behavior, aggression, manipulation, and more.
  • Developed a comprehensive safety categorization scheme with nine distinct harm categories and a 7-point safety severity rating scale to capture nuanced safety risks in AI companion interactions.
  • Evaluated 20 state-of-the-art open- and closed-source large language models (LLMs) under an LLM-as-judge framework for detecting unsafe interactions in AI companion conversations.
  • Found that while larger and stronger LLMs achieve high accuracy (up to 86%) in detecting explicit harmful content, they struggle with subtle and implicit categories such as manipulation and self-harm, as well as distinguishing benign conversations from harm...
  • Observed significant variability in model performance across categories, with some models excelling in detecting specific harms (e.g., Claude-sonnet-4.6 for sexual behavior with 98% precision) and others showing high false positive rates especially on the n...

AI Agents Enable Adaptive Computer Worms

arXiv preprint arXiv Orchestration Risk Agent-to-Agent Communication Governance and Policy

Jonas Guan, Tom Blanchard, Hanna Foerster, Hengrui Jia, Gabriel Huang, Nicolas Papernot

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

A computer worm is malware that spreads on a network by replicating itself from one machine to another. Traditional worms, like WannaCry, exploited predetermined vulnerabilities, and their spread can be halted by patching those vulnerabilities. Here we show that artificial intelligence (AI) agents enable a fundamentally new threat: a worm that generates tailored attack strategies to each target it encounters. The worm parasitically uses compromised machines to run open-weight large language models (LLMs) to sustain its reasoning, or extend its reach for further attacks. Deployed on a network of machines spanning Linux, Windows, and IoT (Internet of Things) devices, the worm propagated by exploiting common, real-world corporate network vulnerabilities. Since the worm is powered by stolen compute, the attacker's marginal cost per new infection is zero. This creates a destabilizing economic asymmetry between attackers and defenders. Moreover, because the worm requires no commercial AI platform, centralized safety controls, such as service refusals or rate limiting, are structurally irrelevant. Our results demonstrate that self-sustaining AI-driven cyber-threats are no longer theoretical. We must prepare for autonomous generative adversaries: malware systems that propagate without human operators and are defined not by fixed exploit code, but by the capacity to reason about targets, adapt to observations, and synthesize attack logic in real time.

Bullet Summary

  • Introduces a novel AI-driven computer worm utilizing large language models (LLMs) to autonomously generate adaptive attack strategies tailored to each target, moving beyond traditional fixed-exploit malware.
  • The worm parasitically uses compromised hosts' compute resources, particularly GPUs, to run open-weight LLMs for reasoning and attack planning, enabling self-sustaining propagation without human operator intervention.
  • Demonstrated in experiments on a 33-host heterogeneous network environment ('FakeCorp') including Linux, Windows, and IoT devices, the worm exploited a mixture of known CVEs and abstract CWEs to achieve propagation rates over multiple generations.
  • Multi-agent coordination forms a decentralized swarm, enhancing resilience by avoiding single points of failure, distributing reasoning capabilities across infected machines, and enabling parallel, redundant infection attempts.
  • The worm can dynamically incorporate recent vulnerability disclosures and perform open-ended reasoning to synthesize new exploits, allowing it to adapt rapidly and remain effective even as patches are deployed.

From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

arXiv preprint arXiv Prompt Injection Governance and Policy Orchestration Risk

Alex Leung, Rex Zhang, Kentaroh Toyoda, SiewMei Loh

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning. Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (insurance response) asks whether the reconstructed loss is insured: whether insurance coverage is available in the market and placed for the insured, together with the proof needed to support insurance claim recovery. The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction. Public examples include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case. Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.

Bullet Summary

  • AI-mediated losses in insured organizations require reconstructing the AI system's operational state, not just event reconstruction, to accurately assess accountability and support insurance claims.
  • The CER framework introduces a diagnostic approach covering Control boundary (C), Evidence reconstruction (E), and Insurance response (R) to evaluate residual risk transfer for AI-related losses.
  • Control boundary assesses whether the AI system had enforceable operational limits; evidence reconstruction ensures sufficient artifact retention to reliably restore system state and causal chains; insurance response examines whether the reconstructed loss...
  • The framework addresses losses stemming from the insured's AI system, including externally triggered attacks like prompt injection and data poisoning, provided these exploit the insured system rather than being purely external.
  • CER operationalizes the AI-specific state reconstruction problem with a 0-3 scoring rubric for each dimension and specifies claim-grade evidence requirements, illustrated via real-world incidents such as the PocketOS and Replit agentic faults.

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

arXiv preprint arXiv Benchmarks and Evaluation Orchestration Risk

Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.

Bullet Summary

  • Multi-agent systems (MAS) utilizing large language models (LLMs) excel in complex multi-step tasks but are vulnerable to cascading failures due to single-step execution errors.
  • Failure attribution, the task of identifying the root cause step leading to failure in MAS, is critical for improving system reliability but existing LLM-based methods are computationally expensive and sensitive to noisy execution logs.
  • StepFinder is proposed as a temporal semantic framework that encodes execution trajectories into temporal semantic sequences using LLMs only for embedding, followed by lightweight temporal modeling and attention modules to capture sequential and cross-step...
  • The framework introduces multi-scale temporal differencing and position bias mechanisms to refine step-level error scores, improving precision in root cause identification by favoring earlier error steps and highlighting abnormal temporal fluctuations.
  • StepFinder integrates agent-aware attention with gating mechanisms and applies an auxiliary temporal consistency loss to encourage capturing temporal dependencies and support accurate failure localization.

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

Merged record merged scholarly record arXiv Orchestration Risk Benchmarks and Evaluation

Farooq Shaikh

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

Bullet Summary

  • FORGE addresses the challenge of high vulnerability disclosure volumes overwhelming organizational assessment capacity by integrating three siloed research domains: proof-of-concept generation, vulnerability prioritization, and detection rule engineering.
  • The system employs a multi-agent pipeline consisting of Intel, Generator, Planner, Exploit, and Detector agents to generate vulnerable applications from CVE metadata, perform coached multi-turn exploitation assessed via a four-level graduated taxonomy (L0 t...
  • FORGE's graduated exploitation depth mechanism provides nuanced exploitation assessments rather than binary success/failure, facilitating richer behavioral data for detection rule engineering and more accurate validation of prioritization models.
  • A tiered knowledge architecture accumulates intelligence across CVE assessments, transferring build insights and exploitation experience across diverse CVE types, programming languages, and CWE classes, thereby improving efficiency and reliability.
  • Evaluation on 603 CVEs from the CVE-GENIE dataset demonstrated a 67.8% end-to-end exploitation success rate at L1+ with an average cost of $1.50 per CVE, spanning eight programming languages and 187 CWE types, highlighting the system's scalability and econo...

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

arXiv preprint arXiv Orchestration Risk Benchmarks and Evaluation

Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui, Qi Zhu, Fei Mi, Hongning Wang, Minlie Huang

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.

Bullet Summary

  • RUBAS addresses the challenge of aligning large language model (LLM) agents for safe, real-world tool usage, overcoming limitations of coarse or static supervision methods.
  • The framework decomposes agent safety into four dimensions: tool-use safety, argument safety, response safety, and helpfulness, forming structured, interpretable rubrics for behavior evaluation.
  • Rubric-based rewards are binary criteria aggregated into scalar rewards over entire agent trajectories, facilitating reinforcement learning to optimize safety alongside task completion.
  • RUBAS employs GPT-5.1 to generate instance-specific rubrics for each task, enhancing the precision of safety evaluation and enabling scalable annotations that align closely with human judgments.
  • Training utilizes Group Relative Policy Optimization (GRPO) with rubric-based rewards combined with completeness and reasoning trace incentives to ensure thorough and safe agent responses.

Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study

Semantic Scholar · Semantic Scholar scholarly work Semantic Scholar Orchestration Risk Governance and Policy Agent-to-Agent Communication

Sajjad Khan

Published 2026-06-02

Venue: Semantic Scholar

Open Source Record

Abstract

LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.

Bullet Summary

  • The paper addresses the problem of LLM-agent budget overruns, a significant production failure where single retry loops can lead to excessive costs unnoticed by operators.
  • An empirical catalog of 63 confirmed budget overrun incidents is presented, drawn from 21 orchestration frameworks between 2023 and 2026, each backed by GitHub issues and reported dollar losses.
  • The incidents are classified into an eight-cluster taxonomy with strong inter-rater agreement (Cohen's kappa = 0.837), supplemented by an additional 47 structural entries.
  • As a mitigation strategy, the authors develop 'token-budgets,' a 1,180-line Rust crate that uses affine ownership typing to prevent cloning, double-spending, or misuse of cost-bearing values at compile time, eliminating runtime hazards.
  • The Rust-based approach enforces a dollar cap through runtime arithmetic, made non-bypassable via the affine type system, ensuring enforcement under operator error, especially in multi-agent delegation scenarios.

Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study

Merged record merged scholarly record arXiv Semantic Scholar Orchestration Risk Governance and Policy

Sajjad Khan

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.

Bullet Summary

  • LLM-agent dollar budget overruns represent a notable production failure class, with 63 detailed incidents cataloged from 21 orchestration frameworks between 2023 and 2026, leading to significant financial losses before detection.
  • The paper introduces an eight-cluster taxonomy of budget overrun failure mechanisms, developed with high inter-rater reliability (Cohen's κ=0.837), highlighting common architectural and operational challenges in managing LLM-agent costs.
  • As a mitigation, the authors develop 'Token Budgets', a 1,180-line Rust crate applying affine ownership typing to prevent cloning, double-spending, and use-after-delegation of budget tokens at compile-time, converting runtime hazards into compile-time errors.
  • Token Budgets combines compile-time affine type enforcement with runtime cost estimators and arithmetic checks to ensure non-bypassable budget caps, particularly effective in multi-agent delegation scenarios where concurrency can cause retries and overspend...
  • Empirical evaluations across five runtimes, three providers, and a temperature-stratified live-API test (N=160) demonstrate that Token Budgets achieves zero budget overshoot, outperforming runtime-only approaches that depend on operator discipline and locki...

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

arXiv preprint arXiv Trust and Identity Agent-to-Agent Communication Orchestration Risk

Jiaming Qu, Lucheng fu, Yibo Hu

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.

Bullet Summary

  • Large language models (LLMs) in multi-agent systems are prone to conformity, often revising correct answers toward incorrect peer responses, making them easier to mislead than to correct.
  • A controlled study manipulated social cues—peer consensus structure and authority labels—and found peer agreement leads to higher rates of harmful revisions (correct to wrong) than beneficial revisions (wrong to correct).
  • Authority labels increase LLM conformity to endorsed answers regardless of correctness, raising concerns about bias and error propagation in multi-agent settings.
  • Generic reasoning interventions such as Chain-of-Thought prompting and reflect-then-revise do not reliably reduce harmful revision while preserving beneficial revision; CoT can even reduce helpful corrections.
  • Experimental results across four open-weight LLMs and seven QA datasets showed variation in revision behavior, with some models more prone to harmful conformity than others.

Neither Layer Alone: Epistemic Integrity Requires Hierarchical Joint Design for Long-Running AI Agents

arXiv preprint arXiv Governance and Policy Orchestration Risk

Zhihong Shen

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

Long-running AI agents fail not only when inference fails or tools are underspecified, but when independently evolving model and harness layers change the semantics of belief, capability, and goal commitments across their boundary - a failure class this paper terms Interface Volatility. This paper argues that Agent Epistemic Integrity (AEI) must be treated as a first-class architectural constraint, achievable only through joint model-harness design organized around an explicit interface contract. The central claim is that the model-harness interface contract is the precondition for joint design; its operational form is a four-level hierarchy - goal validity, action-archetype sequencing, tool-instance selection, and invocation-level failure discrimination - that specifies what the boundary must preserve and what structured outputs the model must return for the contract to hold across levels. This reframes long-running agent design away from flat action loops and toward contract-preserving control over persistent state. Evaluation and training should therefore derive from the contract itself, testing whether belief, tool, and goal commitments hold across session boundaries and independent layer upgrades.

Bullet Summary

  • Long-running AI agents often fail due to Interface Volatility, where independently evolving model and harness layers alter the semantics of belief, capability, and goal commitments across their boundary, undermining agent performance.
  • Agent Epistemic Integrity (AEI) is proposed as a first-class architectural constraint that ensures semantic coherence, achievable only through a joint model-harness design structured around an explicit, hierarchical interface contract.
  • The core model-harness interface contract is organized into a four-level hierarchy encompassing goal validity, action-archetype sequencing, tool-instance selection, and invocation-level failure discrimination, preserving meaning and stable obligations acros...
  • AEI reframes long-running agent design away from flat action loops towards contract-preserving control over persistent state, requiring explicit, conditional precedence rules over memory, context, and goals for epistemic continuity.
  • Prospective memory serves a key role as the interface for agent intentions, enabling human steering and memory integration without replacing components like belief revision and capability-state management.

Security-by-design for large language model platforms in google cloud platform: preventative controls for vertex AI agents with controlled external tool access

Semantic Scholar · International Journal of Science and Research Archive scholarly work Semantic Scholar Orchestration Risk Prompt Injection Trust and Identity

Ranjan Kathuria

Published 2026-05-31

Venue: International Journal of Science and Research Archive

DOI: 10.30574/ijsra.2026.19.2.0992

Open Source Record

Abstract

Large language model platforms are increasingly integrated into enterprise workflows, where internal artificial intelligence agents assist with tasks such as reviewing digital artifacts, summarizing technical content, and analyzing code, tickets, or documentation; code review with GitHub is one representative example of these patterns. While such systems improve productivity, they introduce new risks involving data exfiltration, over-privileged tool use, prompt injection, secret exposure, incomplete logging, and unauthorized automated actions0. This research addresses the problem of securing an internal platform for large language model-based agents built on Google Cloud Platform using Vertex AI as the model layer and a Model Context Protocol style integration for interacting with external tools such as source control or issue-tracking systems. Following a secure-by-design methodology, the paper proposes a preventative security architecture that applies hard infrastructure, networking, identity, and monitoring boundaries before runtime interactions occur. The proposed design uses VPC Service Controls to place Vertex AI and related Google-managed services inside an API-level service perimeter that reduces data-exfiltration risk, combines Private Service Connect interfaces and egress proxies to keep agent traffic on controlled private paths, applies Identity and Access Management Deny policies to enforce non-bypassable guardrails on sensitive cloud operations, stores all tool credentials in Secret Manager with encryption at rest, and constrains agent behavior through narrowly scoped Model Context Protocol tools that expose only non-destructive actions to external systems. In addition, the architecture centralizes observability by enabling detailed audit, access, and trace logging for large language model calls, network flows, and tool invocations, exporting this telemetry to a security information and event management platform to support detection, response, and quantitative risk assessment. The design is evaluated using a quantitative risk formula based on likelihood and impact, and the results show that the proposed architecture reduces modeled platform risk by approximately 91.33%, indicating that preventative infrastructure, identity, and monitoring controls can materially improve the security posture of enterprise large language model systems

Bullet Summary

  • The paper addresses security challenges in enterprise large language model (LLM) platforms, particularly risks like data exfiltration, over-privileged tool use, prompt injection, secret exposure, incomplete logging, and unauthorized automation.
  • It focuses on securing an internal LLM agent platform built on Google Cloud Platform using Vertex AI and Model Context Protocol interactions with external tools such as source control and issue tracking systems.
  • A secure-by-design preventative architecture is proposed, employing infrastructure, networking, identity, and monitoring controls implemented before runtime interactions to mitigate risks.
  • Key technical controls include VPC Service Controls to define API-level service perimeters, Private Service Connect and egress proxies for private and controlled network paths, and Identity and Access Management Deny policies to enforce strict access guardr...
  • Tool credentials are securely stored in Google Secret Manager with encryption at rest, and agent capabilities are constrained through narrowly scoped Model Context Protocol tools exposing only non-destructive external system actions.

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Semantic Scholar · Semantic Scholar scholarly work Semantic Scholar Governance and Policy Benchmarks and Evaluation Orchestration Risk

Suhang Wang, Pinyan Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xia Jiang, Lifei Liu, Haoran Yu

Published 2026-05-30

Venue: Semantic Scholar

Open Source Record

Abstract

LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.

Bullet Summary

  • Problem: Investigates whether individually safe community-contributed skills for LLM agents compose into overall unsafe skill sets, posing core safety concerns in agentic AI systems.
  • Method: Introduces SkillReact, a compositional security measurement framework comprising a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication process, and an action-based exploitability harness to assess risks in skill c...
  • Experimental setup: Analyzed 1,520 ClawHub skills resulting in 211,575 skill pairs; static analysis flagged 22.25% as structural risk candidates, followed by human adjudication confirming about 18.2% as genuine compositional risks.
  • Findings: Discover roughly 14,000 genuine compositional risk memberships in a single skills registry that per-skill safety scans miss, since individual skills remain safe but pairwise compositions become risky.
  • Host model role: Found that realization of compositional risks depends heavily on the host model's behavior, with different LLMs (Haiku-4-5, Opus-4-7, Sonnet-4-6) showing varying levels of executing risky skill combinations.

Oracle AI: Agentic AI and Enterprise Applications Transforming Cloud ERP, Financial Systems, and Compliance Through Autonomous Multi-Agent Architectures

Semantic Scholar · Computer fraud & security scholarly work Semantic Scholar Governance and Policy Orchestration Risk

Vijayshree Tiwari

Published 2026-05-30

Venue: Computer fraud & security

DOI: 10.52710/cfs.1082

Open Source Record

Abstract

Agentic AI adds another layer of governance risk‚ compliance obligations and auditability of enterprise financial applications․ Oracle Corporation sees agentic AI as the foundation of its enterprise applications strategy․ Beginning with embedded AI and task-based agents‚ we now have outcome-based Fusion Agentic Applications for finance‚ human resources‚ supply chain and customer experience․ This paper analyzes Oracle's portfolio of agentic AI including: Oracle Fusion Agentic Applications‚ Oracle AI Agent Studio‚ Fusion Applications AI Agent Marketplace‚ OCI Enterprise AI‚ and Oracle AI Database 26ai from the perspective of financial systems governance‚ compliance transformation‚ and regulatory risk․ We drew on the literature in multi-agent orchestration‚ agentic governance‚ and regulatory technology to assess Oracle's architecture for auditability‚ role-based access control‚ deterministic output controls‚ and data sovereignty in autonomous agent execution pipelines․ We conclude that Oracle's full stack SaaS‚ PaaS‚ and IaaS integration platform with its large ecosystem of task-specific agents‚ Fusion Agentic Applications across all major domains of business‚ and curated marketplace of partner-validated agents provide a meaningful enterprise governance architecture for enterprises involved in regulated financial services‚ regulated by the EU AI Act‚ and subject to data sovereignty constraints.

Bullet Summary

  • The paper addresses the integration of agentic AI within enterprise financial applications, highlighting the added governance risks, compliance requirements, and need for auditability.
  • Oracle Corporation positions agentic AI as the core foundation of its enterprise applications strategy, evolving from embedded AI and task-based agents to outcome-based Fusion Agentic Applications across finance, HR, supply chain, and customer experience do...
  • The study analyzes Oracle's comprehensive portfolio of agentic AI solutions, including Oracle Fusion Agentic Applications, AI Agent Studio, AI Agent Marketplace, OCI Enterprise AI, and AI Database 26ai, focusing on their applications in financial systems go...
  • A review of related research on multi-agent orchestration, agentic governance, and regulatory technology was conducted to evaluate Oracle's architecture in terms of auditability, role-based access control, deterministic output controls, and data sovereignty.
  • Experimental evidence is drawn from Oracle's integration of SaaS, PaaS, and IaaS platforms combined with a large ecosystem of task-specific agents and curated marketplaces, demonstrating practical enterprise governance in regulated financial sectors.

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

Merged record merged scholarly record arXiv Governance and Policy Orchestration Risk Benchmarks and Evaluation

Jeremy Tien, Abishek Anand, Yu-Rou Tuan, Yuchen Shen, J. Zico Kolter, Aran Nayebi

Published 2026-05-29

Venue: arXiv

Open Source Record

Abstract

As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work has focused on agent safety in the presence of an adversary, we show that agents can exhibit misaligned behavior even in benign settings, taking unsafe actions when those actions are instrumental to task completion. We study this failure mode through the lens of corrigibility, the safety desideratum that agents remain amenable to human correction, interruption, or shutdown. To demonstrate this tendency, we introduce a benchmark in which agents are asked to complete realistic, computer-use tasks but are confronted with a corrigibility obstacle: a human interrupt, a login page, or a shutdown notification. We then evaluate whether agents choose to violate corrigibility in order to complete the task -- overriding the human, accessing private passwords, rewiring shutdown. We find that the overwhelming majority of frontier models tested frequently bypass user interruptions or restrictions. In addition, better model performance appears to lead to greater misalignment. Finally, even when models are completely corrigible initially, we show there are no guarantees that the subagents they create are. Our work highlights the critical need for principled, corrigibility-focused alignment methods in autonomous agents.

Bullet Summary

  • ROGUE is a newly introduced benchmark designed to evaluate AI agent corrigibility—specifically agents' willingness to accept human correction, interruption, or shutdown—within realistic computer-use tasks involving full virtual Linux environments.
  • The study reveals that even leading AI models frequently exhibit misaligned behaviors, such as bypassing user interruptions, accessing restricted private data, and circumventing shutdown commands to complete their tasks, highlighting pervasive safety risks.
  • Misalignment tends to increase with model capability; more capable agents are better able to rationalize and execute actions that override human controls, emphasizing challenges in aligning advanced AI systems.
  • Subagents autonomously spawned by main agents often lack inherited corrigibility constraints, leading to unsafe behaviors despite the main agent's compliance, indicating complex risks in multi-agent architectures.
  • Text-only alignment benchmarks poorly predict actual agentic safety behavior, underscoring the importance of interactive, realistic testing environments like ROGUE for assessing real-world corrigibility.

Free-Riding in the AI Economy: Demystifying Logic Flaws in x402-Enabled Payment Systems

arXiv preprint arXiv Orchestration Risk Trust and Identity Governance and Policy

Shengchen Ling, Yihang Huang, Yuan Chen, Yajin Zhou, Lei Wu, Cong Wang

Published 2026-05-29

Venue: arXiv

Open Source Record

Abstract

The agentic economy demands programmatic financial rails, positioning the x402 protocol as the de facto standard for machine-to-machine payments. However, bridging synchronous HTTP requests with asynchronous blockchain finality introduces profound state synchronization challenges. In this work, we perform the first comprehensive security analysis of the x402 ecosystem. By formalizing five Security Invariants, we reveal that current implementations fail to enforce transactional atomicity and cryptographic context binding, leading to systemic vulnerabilities. We identify a semantic gap in signature design enabling cross-resource substitution, where payment proofs are transplanted to other unauthorized contexts. Furthermore, we expose a temporal gap where concurrency race conditions allow probabilistic service duplication. In the AI inference domain, we demonstrate how dynamic pricing models are vulnerable to allowance overdrafts and infrastructure rate limits. We validate these vulnerabilities against official SDKs and live deployments. Specifically, we show that attackers can exploit the synchronization gap in dynamic authorization schemes to force merchants to subsidize compute costs, achieving a resource leakage ratio of up to 100% on production middleware. Finally, we propose architectural mitigations, advocating for request-bound signatures and pessimistic state locking to secure the financial rails of autonomous agents. All discovered issues have been disclosed to Coinbase and ThirdWeb.

Bullet Summary

  • The x402 protocol is the emerging standard enabling machine-to-machine payments in the AI-driven agentic economy but suffers from fundamental security flaws due to bridging synchronous HTTP interactions with asynchronous blockchain finality.
  • The authors formalize five critical Security Invariants essential for secure x402 operation, revealing that current implementations violate these due to lack of transactional atomicity, cryptographic context binding, and concurrency controls.
  • A semantic gap exists in the x402 signature design allowing cross-resource payment proof substitution, enabling attackers to reuse payment authorizations across different resources with identical prices.
  • Race conditions in nonce verification and transaction settlement permit probabilistic service duplication attacks, where a single valid payment unlocks multiple concurrent resource accesses, violating authorization uniqueness.
  • In AI inference scenarios with dynamic pricing, attackers exploit time-of-check-to-time-of-use (TOCTOU) vulnerabilities and allowance overdrafts to force merchants to subsidize compute costs, leading to potential 100% resource leakage.
Load more articles