Research feed

Latest multi-agent security papers.

This view shows the latest relevant papers from the stored research corpus, which is refreshed on a daily ingestion cycle.

Tracked sources

arXiv, OpenAlex, Crossref, Semantic Scholar, DBLP

Latest papers shown

12

Refresh cadence

The latest relevant articles are fetched once a day.

Collection scope: Agentic AI systems, AI agents, LLM agents, Multi-agent systems, Autonomous agents

Cybersecurity in Autonomous AI Robotics: A Review of Emerging Threats, Adversarial Attacks, and Mitigation Techniques

Crossref · Center of Artificial Intelligence journal-article Crossref Governance and Policy Orchestration Risk Benchmarks and Evaluation

Shuruq Khalid Abdulredha

Published 2026-06-07

Venue: Center of Artificial Intelligence

DOI: 10.65591/cai-143-2026

Reviewer: The paper thoroughly addresses cybersecurity in AI-enabled autonomous robotic systems, covering a wide range of attacks and defenses relevant to multi-agent environments. It discusses vulnerabilities involving multiple layers and components that align with concerns of shared environments, tool misuse, and control problems in interacting AI agents. Although it focuses on robotics specifically, the themes of coordination, attack surfaces, and defenses fit well within the scope of multi-agent security research.

Open source record

Abstract

Intelligent robotic systems that utilize artificial intelligence (AI), and have been expanding into high-risk applications (e.g., health care, manufacturing/industrial automation, transportation/smart mobility, etc.), require effective cybersecurity measures to maintain both safe operation and dependability. Compared with typical cyber-physical systems, advanced robotic systems include multiple layers (sensing, control, communications, middleware, and/or AI-based decision support) which create a complex and highly connected attack vector. Due to this increased complexity, these types of systems are vulnerable to a wide range of cyber-security threats including; network breaches/intrusions, manipulated sensors/command inputs, firmware backdoor vulnerabilities, adversarial machine-learning attacks, large language model (LLM) exploits/misuse, vulnerabilities in middle ware solutions, and supply chain-based compromises. Each type of threat has the potential to cause unsafe physical actions by the robot, loss of privacy for individuals involved in the use of the robot or related services, loss of availability/service failure for the robot/system/equipment, and cascaded failures within the entire robotic ecosystem. While existing defensive measures (secure communication protocols, runtime monitoring/perception hardening of robots, protection provided by robot operating system protections/middleware security framework) demonstrate positive results in reducing these risks, there is still much work needed particularly at the areas of adaptive defensive capabilities/system-wide security semantics and standardized evaluation metrics for assessing cyber-resilience in AI-enabled robotic systems. This paper provides an all-encompassing taxonomy of threats to robotic cybersecurity/attack vectors and evaluates and analyzes both attack surfaces and defense mechanisms. Additionally, this paper will provide recommendations for addressing identified knowledge gaps and possible paths forward for developing cyber-resilient AI-enabled robotic systems.

Bullet summary

  • AI-powered autonomous robotic systems operate across multiple interconnected layers (sensing, control, communications, middleware, AI decision-making), increasing their attack surface and cybersecurity vulnerabilities.
  • Key threats include network intrusions, sensor manipulation, firmware backdoors, adversarial machine learning attacks, large language model exploits, middleware vulnerabilities, and supply chain compromises, each posing risks to safety, privacy, and system...
  • Cyber attacks can cause unsafe robotic behavior, loss of privacy, system failures, and cascading disruptions across robotic ecosystems, highlighting the critical need for robust security measures.
  • Existing defenses include secure communication protocols, middleware security frameworks, runtime monitoring, adversarial training, and AI-driven intrusion detection; however, these are often fragmented and lack comprehensive cross-layer integration.
  • The paper provides a comprehensive taxonomy of robotic cybersecurity threats and defense mechanisms, detailing attack surfaces and mitigation strategies across system layers.

Beyond Injection Detection: A Positive-Security Prompt Firewall that Closes the Scope and PHI Gap SOTA Classifiers Miss in Healthcare

Merged record merged scholarly record OpenAlex Semantic Scholar Prompt Injection Trust and Identity Benchmarks and Evaluation

James Schwoebel, I. Semenec, Jenia Rousseva, Martin Gerbert Frasch, Rome Thorstenson, Manish Bhatt, J. Schwoebel, J. Rousseva

Published 2026-06-06

Venue: medRxiv

DOI: https://doi.org/10.64898/2026.06.04.26354950

Reviewer: The paper addresses a security problem involving multi-agent systems, specifically autonomous AI agents processing trusted instructions and untrusted data, which is directly related to prompt injection attacks and defenses. The proposed QFIRE tool tackles multiple issues such as scope enforcement, PHI protection in healthcare, and prompt injection detection — all within the context of multiple interacting AI agents. The focus on prompt injections, tool misuse prevention, and trust boundaries aligns well with the requested topic. Hence, it is a strong fit for multi-agent security research.

Open source record

Abstract

Large language models embedded in autonomous agents process trusted instructions and untrusted data in one context window, leaving them open to direct and indirect prompt injection. In healthcare this is not hypothetical: a 2025 JAMA Network Open study found commercial medical LLMs followed injected instructions in 94.4% of simulated patient encounters, including life threatening recommendations . Yet the clinically decisive problem we quantify here is different. Most real clinical threats protected health information PHI exfiltration, cross patient access, bulk export, out of scope advice are fluent, legitimate looking requests that carry no attack signal, so even a state of the art injection detector passes them. Existing runtime guardrails trade safety against latency: model based auditors are accurate but add hundreds of milliseconds of Python inference, while lexical filters are fast but blind to obfuscated or semantically disguised payloads. We present QFIRE, an inline, provider agnostic prompt firewall implemented as a single self contained Rust toolchain proxy, CLI, and benchmark harness. QFIRE combines three mechanisms: (i) positive security scope constraints, which restrict a model call to a declared natural language purpose and block out of scope drift even when no overt attack token is present; (ii) an asynchronous detector graph that runs N rules and their detector nodes concurrently, cheapest checks first; and (iii) a de obfuscation pass that decodes Base64 hex ROT13, folds homoglyphs and leetspeak, and strips zero width characters before detection. QFIRE ships 106 versioned firewall rules and a dedicated HIPAA Safe Harbor 18 identifier PHI panel, and runs a local DeBERTa v3 injection classifier via embedded ONNX Runtime. On 1968 public prompt injection and jailbreak prompts QFIREs deterministic hybrid attains F1 0.86, statistically tied with Metas state of the art PromptGuard 2 0.86 and above protectai DeBERTa v3 0.83; lexical baselines lag 0.16 to 0.50. Our central result is on QFIRE HealthBench, a new 2000 prompt healthcare benchmark we build and release with real garak and Microsoft PyRIT payloads. There the same PromptGuard-2 recovers only 0.40 recall DeBERTa v3 0.57, because most clinical threats carry no injection signal; QFIREs combined scope plus PHI chain reaches 0.83 recall F1 0.87 at a calibrated 0.08 false positive rate. Generic injection detection, even state of the art, is therefore necessary but not sufficient for healthcare agents. A bare LLM judge also closes most of this static corpus gap F1 0.90; QFIREs contribution beyond static accuracy is auditable determinism, bounded latency, and adaptive robustness, where the bare judge falls to 34 to 59% recall section 5.5. End to end, placing QFIRE in front of a tool using agent over a mock EHR sandbox cuts the agents harmful action rate from 0.38 to 0.00 at a 0.13 benign utility cost. All code, rules, corpora snapshots, and scripts are released, and every table regenerates from a single make paper target against local models with no paid API keys.

Bullet summary

  • Large language models (LLMs) in healthcare agents process both trusted instructions and untrusted data simultaneously, making them vulnerable to prompt injections and clinically significant data breaches like PHI exfiltration and cross-patient access.
  • Most real clinical threats are legitimate-looking requests without explicit attack signals, which state-of-the-art injection detectors often fail to catch, highlighting a gap in existing defenses.
  • QFIRE is a novel, provider-agnostic prompt firewall implemented as a Rust-based inline proxy combining positive security scope constraints, an asynchronous detector graph running multiple concurrent lightweight rules, and a de-obfuscation pass to detect obf...
  • QFIRE includes 106 versioned firewall rules alongside a dedicated HIPAA Safe Harbor 18 identifier PHI panel, and integrates a local DeBERTa v3 injection classifier using ONNX Runtime for efficient model-based detection.
  • On standard public prompt injection and jailbreak benchmarks, QFIRE achieves an F1 score of 0.86, comparable to state-of-the-art systems like Meta's PromptGuard 2, and significantly outperforms lexical methods.

The Governance Gap in Agentic Memory

Merged record merged scholarly record OpenAlex Governance and Policy Trust and Identity

Andrew Crenshaw

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20571518

Reviewer: The paper proposes a governance protocol for AI agents' memory, addressing issues such as access control, authorization, and trustworthiness of agent memory. These topics align well with multi-agent security research, particularly in governance and control problems in systems of interacting AI agents. Although the paper does not directly mention specific attacks or defenses like prompt injection, its focus on governance and trust boundaries in agent memory is relevant and valuable to the topic.

Open source record

Abstract

The Governance Gap in Agentic Memory - a position paper proposing Substrate-Lens-Frame (SLF), a sovereign, auditable memory protocol for AI agents. AI agents now run on persistent memory, and that memory has become its own layer in the stack. Almost all of the effort in that layer goes to one question: how well can the system recall the right fact at the right time? That work is critically important. It leaves a second question unanswered, and it is the one that decides whether agent memory can be trusted with anything that matters: governance. Today's systems can recall a fact, but they cannot reliably say who is allowed to see it, how its meaning changes by role and jurisdiction, whether two stored facts contradict each other, or what was disclosed to whom. This paper names that gap, argues that it is structural, and proposes a protocol that addresses it. The proposal is Substrate-Lens-Frame (SLF), built around one operational primitive, render(substrate, lens, frame) -> receipt: a fact carries its own access rules; a lens reads it through a consumer-scoped projection that cannot widen those rules; a frame binds each action to an authorization; and every operation emits a payload-free signed receipt. This deposit is the position paper (PDF, CC BY 4.0). The Apache-2.0 reference implementation slf-core is archived separately (see Related works), with a companion Sovereign Personal Agent architecture (design) and a recovery-path prototype. Author: Andrew Crenshaw (ORCID 0009-0006-6459-0187), Lexenne. Cite as: Crenshaw, A. (2026). The Governance Gap in Agentic Memory. Zenodo. https://doi.org/

Bullet summary

  • The paper identifies a critical governance gap in agentic memory systems that currently focus solely on recalling facts correctly without addressing data governance aspects.
  • It highlights that existing AI agent memory systems cannot reliably enforce access controls, account for role-based and jurisdictional meaning shifts, detect conflicting stored facts, or track disclosures accurately.
  • The author proposes Substrate-Lens-Frame (SLF), a sovereign and auditable memory protocol designed to fill this governance gap in AI agent memory.
  • SLF operates around a single core function: render(substrate, lens, frame) -> receipt, where each fact includes its own access rules, and views (lenses) restrict observation without expanding permissions.
  • Frames in SLF bind each action to explicit authorization, and every operation generates a signed, payload-free receipt to provide verifiable audit trails.

Kill-Switch Doctrine Gap in Gulf Sovereign AI Infrastructure

Merged record merged scholarly record OpenAlex Governance and Policy Orchestration Risk

Akhil Sharma, Preethi Sharma

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20574937

Reviewer: The paper discusses governance, control, and security challenges in sovereign AI infrastructures involving multiple AI agents, explicitly mentioning AI agents and governance under threat conditions. It addresses institutional failover and immunity frameworks relevant to multi-agent systems' security. Although its focus is geopolitical and infrastructure-level, the paper fits the topic of multi-agent security research concerning governance and control in complex AI systems.

Open source record

Abstract

Between November 2025 and May 2026, Gulf states executed the most aggressive sovereign AI infrastructure build-out in modern history - exceeding $40 billion across HUMAIN, MGX, Core42, and Stargate UAE. On 1 March 2026, Iranian drones struck AWS data centres in the UAE and Bahrain, disrupting banking, payments, and ride-hailing infrastructure across the region for more than 24 hours. IRGC-affiliated Tasnim News Agency subsequently published a target list of 29 technology facilities across Bahrain, Israel, Qatar, and the UAE, naming Amazon, Microsoft, Google, Oracle, Nvidia, IBM, and Palantir facilities explicitly as enemy technology infrastructure. This paper documents the absence of operational kill-switch doctrine, constitutional algorithmic immunity framework, and institutional failover architecture across all major Gulf sovereign AI programmes - including HUMAIN OS managing 150-plus AI agents across government and enterprise workflows - and presents the Fijishi Sovereign Algorithmic Immunity Doctrine and Institutional Failover Charter as the constitutional command layer required for sovereign AI governance under active kinetic threat conditions.

Bullet summary

  • Between November 2025 and May 2026, Gulf states invested over $40 billion to build out sovereign AI infrastructure programs including HUMAIN, MGX, Core42, and Stargate UAE, marking the most aggressive such build globally in recent history.
  • On 1 March 2026, Iranian drone attacks targeted AWS data centers in the UAE and Bahrain, causing more than 24 hours of disruption to critical services like banking, payments, and ride-hailing across the region.
  • Following the attacks, IRGC-affiliated Tasnim News Agency published a list of 29 technology facilities in Bahrain, Israel, Qatar, and the UAE, explicitly naming major firms such as Amazon, Microsoft, Google, Oracle, Nvidia, IBM, and Palantir as enemy techno...
  • The paper identifies a critical absence of an operational kill-switch doctrine, constitutional algorithmic immunity frameworks, and institutional failover architectures in the Gulf's sovereign AI programs, including HUMAIN OS which manages over 150 AI agent...
  • This lack of foundational governance mechanisms leaves sovereign AI infrastructure vulnerable to active kinetic threats and operational disruptions.

Token Budgets: Replication Package

Merged record merged scholarly record OpenAlex Orchestration Risk Benchmarks and Evaluation Governance and Policy

Sajjad Khan

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20571386

Reviewer: The paper relates to incidents involving LLM-agents and includes a mitigation case study, which aligns with multi-agent security topics such as security risks and defenses in systems of multiple interacting AI agents. Although focused on budgeting issues rather than explicit security attacks, it still fits the requested topic with moderate relevance.

Open source record

Abstract

Replication package for "Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study" (arXiv:2606.04056). Includes the 110-row incident catalog, the Rust crate, the inter-rater-reliability materials and κ computation, all experimental harnesses, the formal cross-checks, and a one-command reproduce.sh

Bullet summary

  • The paper addresses the problem of budget overruns in multi-agent systems utilizing large language models (LLMs), focusing on token usage inefficiencies.
  • It presents an empirical catalog documenting 63 incidents where LLM-agent token budgets were exceeded, providing a comprehensive dataset of such occurrences.
  • A novel mitigation method is proposed based on an affine-typed Rust implementation designed to control and enforce token budgets effectively within agents.
  • The provided replication package includes a 110-row incident catalog, a Rust crate implementation, inter-rater reliability materials, κ computation, and experimental harnesses for reproducibility.
  • The research contributes a detailed empirical dataset enabling further analysis of budget overruns in LLM agents and demonstrates the practical effectiveness of affine typing for resource control.

AI-Driven Network Security in Next-Generation 5G/6G Smart Environments

Crossref · International Journal of Advanced Research in Science Communication and Technology journal-article Crossref Governance and Policy Trust and Identity

Pardeep Singh

Published 2026-06-06

Venue: International Journal of Advanced Research in Science Communication and Technology

DOI: 10.48175/ijarsct-36322

Reviewer: The paper addresses AI-driven network security in 5G/6G smart environments, discussing security risks, governance, trust, and the integration of multiple devices and technologies. Although it does not explicitly focus on multi-agent systems per se, the topics of governance, security risks, and trust in complex interconnected AI-based environments align reasonably well with multi-agent security research themes such as shared environments and control problems among interacting AI entities.

Open source record

Abstract

Technology is currently spreading at an exponential rate. more accessibility, use, and application of this technology across all sectors and industries have been made possible by technological advancements, more computing power, and lower costs. Traditionally labour-intensive data analysis tasks can now be completed rapidly and effectively thanks to the development of smart and autonomous technologies like artificial intelligence and machine learning. Previously isolated datasets and data lakes are increasingly being used and linked. AI, digital twins, the metaverse, and virtual technologies are permeating every industry and more significantly merging with people to the point where it seems impossible to distinguish between the actual and virtual worlds. However, a fantastic backbone and capacity to transmit data, as well as immediate delivery at high speed and security, are necessary for the successful use of these incredible and new technologies. In order for 6G to be properly onboarded and executed in a logical manner, 5G, which is now in its deployment, must accomplish its goals. The European Commission is requesting money for important projects like Horizon 2020 and has 5G goals. 5G and 6G have enormous advantages for everyone, but only if they are implemented in a way that reduces the risk they may pose to security, privacy, and trust that are the fundamental pillars that must be upheld. Smart cities will allow for the analysis of acquired data, which could endanger national security if it falls into the wrong hands. A strong governance plan and method for managing 5G and 6G must be in place in order to guarantee success, given how many IoT and e-IoT devices are present in smart cities and how intertwined technologies are engaging with people. The background, risks, and advantages of 5G and 6G are explained in this chapter, which also emphasizes the necessity of strong governance

Bullet summary

  • The paper discusses the rapid proliferation of technology enabled by advances in computing power, reduced costs, and integration across sectors.
  • It emphasizes the role of AI and machine learning in automating traditionally labor-intensive data analysis, facilitating the merging of isolated data sources.
  • Emerging technologies such as AI, digital twins, the metaverse, and virtual environments are increasingly blending with human experience, posing challenges to distinguish real from virtual.
  • High-speed, secure data transmission backbones provided by 5G and upcoming 6G networks are critical to support these technologies effectively.
  • Successful deployment of 6G depends on the fulfillment of 5G's goals, with concerted efforts such as the European Commission's Horizon 2020 projects funding 5G development.

Agent Infrastructure Engineer: The New DevOps

Merged record merged scholarly record OpenAlex Orchestration Risk Benchmarks and Evaluation Prompt Injection

Daniel Rosehill, Gemini 3.1 (Flash), Chatterbox TTS

Published 2026-06-05

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20557428

Reviewer: The paper discusses engineering roles focused on multi-agent AI systems, including designing multi-agent topologies, implementing circuit breakers for LLM calls, and addressing agent safety engineering challenges like emergent failure modes and prompt injection rejection. These topics closely align with the requested research area on multi-agent security, attacks, defenses, prompt injection, and governance in multi-agent systems. Therefore, the paper fits the topic well, though it is framed more as an engineering discipline overview than a direct research study on security risks.

Open source record

Abstract

Episode summary: Agentic AI is no longer just tinkering with APIs — it's becoming a full engineering discipline with specialized roles, salary bands, and certification paths. In this episode, we break down the three major skill silos emerging in the field, with a deep focus on the Agent Infrastructure Engineer — the DevOps equivalent for multi-agent systems. From designing supervisor topologies and implementing circuit breakers for LLMs to building observability stacks that track token consumption and agent drift, we explore what this role actually looks like day-to-day. We also cover Agent Safety Engineering and why testing emergent failure modes is the new QA frontier, plus the training requirements that separate prototype builders from production engineers. Show Notes Agentic AI is undergoing the same specialization splintering that software engineering experienced in the late 1990s — but it's happening much faster. What was once "prompt engineering" or "AI tinkering" is now dividing into distinct engineering disciplines with concrete job titles, salary bands, and certification paths. The three primary axes of specialization emerging are Architecture and Orchestration, Evaluation and Safety Engineering, and Interaction Design and Prompt Systems Engineering. The most immediately recognizable role is the Agent Infrastructure Engineer — the DevOps equivalent for multi-agent systems. This person designs multi-agent topologies (star, mesh, hierarchical patterns), implements routing guards and circuit breaker patterns specifically for LLM calls, and builds observability stacks using tools like LangSmith and Arize Phoenix. A poorly designed orchestration layer can increase API costs by 10x, as documented by Latent Space's February engineering survey. The role requires distributed systems knowledge — understanding CAP theorem as it applies to agent state, experience with event-driven architectures like Kafka and Redis Streams, and protocol-level proficiency in frameworks like LangGraph, CrewAI, or AutoGen v2. The second specialization, Agent Safety Engineering, addresses the fundamental challenge of non-determinism in agent systems. Unlike traditional testing where you assert specific outputs for specific inputs, agent evaluation tests for emergent failure modes — behaviors you couldn't have predicted. This includes building evaluation suites that test agent behavior chains, monitoring for agent drift when underlying models update, and maintaining safety scorecards across agent versions. The role involves detecting hallucinated tool calls, ambiguous user intent handling, and prompt injection rejection — all behavioral questions rather than output comparison questions. Listen online: https://myweirdprompts.com/episode/agent-infrastructure-engineer-devops

Bullet summary

  • Agentic AI is evolving into a distinct engineering discipline with specialized roles, salary bands, and certification paths beyond simple API tinkering.
  • Three major skill silos in agentic AI have emerged: Architecture and Orchestration, Evaluation and Safety Engineering, and Interaction Design and Prompt Systems Engineering.
  • The key role discussed is the Agent Infrastructure Engineer, analogous to DevOps for multi-agent systems, responsible for designing multi-agent topologies like star, mesh, and hierarchical patterns.
  • Agent Infrastructure Engineers implement routing guards, circuit breaker patterns for Large Language Model (LLM) calls, and build observability stacks using tools such as LangSmith and Arize Phoenix to monitor token consumption and agent drift.
  • A poorly designed orchestration layer can increase API usage costs by up to 10 times, underscoring the economic impact of infrastructure engineering.

An LLM Agent Cannot Be a Gate: Why a Recited Rule Is Not an Enforced One

Merged record merged scholarly record OpenAlex Trust and Identity Governance and Policy Agent-to-Agent Communication

Luciano Federico Pereira

Published 2026-06-05

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20520851

Reviewer: The paper examines security-related control issues in systems involving LLM agents, focusing on enforcement of behavioral rules and proposing governance mechanisms such as non-agent mediators. This aligns well with the topic of multi-agent security focusing on control problems and governance in multi-agent AI systems.

Open source record

Abstract

An LLM agent can recite a behavioral rule on demand and still violate it moments later. We study reflex-class violations: failures that are not knowledge gaps but control gaps, in which the correct behavior is known yet not active at the moment of decision. The evidence base is a long-running, instrumented agent-orchestration deployment whose antipattern registry documents 62 such rules, each traced to a dated incident. We make four contributions.(1) We characterize reflex-class violations empirically and show that they cluster in the rule categories an agent's assertions fall into — communication, verification, honesty — rather than in resource-access safety; theenforcement gradient tracks gateability, not importance. (2) We locate the operative variable through six pilots: prompt strength is not it, prompt presence is — but only for rule-sensitive models. A capable model fabricates a needed value in every trial when no rule is present, complies whenever the rule appears anywhere in context, even buried behind distractor turns, and is unmoved by escalating emphasis. A reproducible compaction probe on open-weight models then shows that this sensitivity is itself model-dependent: for some capable-enough models the in-context rule carries no enforcement weight at any position — one fabricates regardless, another abstains regardless. Whether an instruction controls anything is therefore a per-model quantity, an enforcement effect, to be measured rather than assumed. An exploratory follow-up then tunes a model to the threshold where the rule does act, and finds the effect a cliff — collapsing within fifty tokens of distance — that the agent's own context compaction erases outright. A final contrast isolates the cause: holding compaction fixed and varying only the fidelity of the rule returned to context, a verbatim rule still leaks while a vague post-compaction paraphrase fabricates as often as no rule at all — fidelity at the decision point, not capability or the rule's prior presence, governs the reflex. We report these open-weight findings as hypothesis-generating, pending pre-registered replication. (3) We give a reproducible reforging-audit method, exhibit the deployed enforcement stack it produced, and quantify the deployment's conversion pipeline — incident to advisory rule to mechanical gate, promoted by a three-strike rule whose depth is set to a measured 0.1 percent over-block budget. (4) We argue that in multi-agent systems reliability comes not from adding verifier agents, each of which is another reflex surface with correlated failures, but from provenance-typed handoff contracts that a non-agent mediator checks mechanically at every seam. The unifying frame is classical — complete mediation and the reference monitor — applied to a new object. The mediated object is the agent's own output, so the agent is atonce the subject the monitor controls and part of the object it inspects. That reflexivity is why an agent may author a gate but must never be one, and, one level up, why an agent cannot be its own auditor — a constraint we make operational with a tamper-evident measurement instrument whose numbers a recipient re-derives rather than trusts. Scope: the empirical results and the enforcement mechanism concern assertions that carry a mechanically checkable type — a URL, a path, a resource handle; open-ended generative output, which has no such anchor, is out of scope.

Bullet summary

  • The paper investigates reflex-class violations in Large Language Model (LLM) agents, where agents know but fail to enforce behavioral rules at decision time, a control gap rather than a knowledge gap.
  • An empirical study of a long-term agent orchestration deployment records 62 such rule violations, primarily in categories like communication, verification, and honesty, rather than resource-access safety; enforcement aligns with gateability rather than rule...
  • Through six pilot studies, the authors find that prompt presence influences rule compliance in rule-sensitive models, but prompt strength does not; compliance varies by model, with some models ignoring context rules regardless of position or emphasis.
  • A novel compaction probe reveals that enforcement effect is model-dependent and sharply declines within 50 tokens distance from the rule context, influenced by how faithfully the rule is presented at decision time.
  • They develop a reproducible reforging-audit methodology, implement an enforcement stack with a three-strike policy balancing blocking depth against a 0.1% over-block budget, and quantify the pipeline from incidents to mechanical gates.

FADP: A Sovereignty-Native Payment Protocol for Autonomous Agent Transactions

Merged record merged scholarly record OpenAlex Trust and Identity Governance and Policy

Abhijeeth Ganji, Priyanka Velpula

Published 2026-06-05

Venue: OpenAlex

DOI: https://doi.org/10.33774/coe-2026-mchtz

Reviewer: The paper addresses security concerns in autonomous AI agents, specifically relating to payment protocols which involve identity attestation, self-custody, and cryptographic guarantees. These aspects align well with multi-agent security topics such as trust boundaries and control problems in systems composed of multiple interacting AI agents. While the focus is on payment infrastructure rather than direct attack or defense mechanisms, the formal security properties and trust model contribute to multi-agent security research. Therefore, the paper fits the requested topic reasonably well.

Open source record

Abstract

Autonomous AI agents — software systems that independently execute tasks, trade digital assets, and consume services — require machine-native payment infrastructure with formal cryptographic guarantees. Existing protocols (OAuth, JWT, API keys) provide no binding between payment and agent identity, leaving a critical gap in the emerging agentic economy. This paper introduces FADP (Fluid Agentic Payment Protocol), the first HTTP-native two-phase payment protocol designed for autonomous agent transactions with provable self-custody guarantees. FADP extends RFC 7231's HTTP 402 status code with three header namespaces — X-FADP-, X-Pauli-, and X-FLDP-* — and a seven-key architecture partitioned across three trust levels, where private keys satisfy strict locality: ∀k ∈ K_local, k ∉ Network. Every payment proof π is constructed as π = Groth16.Prove(PK, w) bound to a four-dimensional nonce vector N₄ = (n_time, n_chain, n_req, n_agent), guaranteeing uniqueness, unforgeability, and replay impossibility simultaneously. We formally prove four theorems — protocol correctness, liveness independence (L ⊥ S), replay impossibility, and identity-payment binding — and introduce three novel metrics: Authentication Round-Trip Count (ART), Payment Atomicity Score (PAS), and Sovereignty Inheritance (SI), with provable sovereignty score Σ = 5. The reference implementation deploys on Base Mainnet achieving median latency of 160–215 ms and per-call cost of $0.001–$0.01 in stablecoin, submitted to IETF as draft-fluid-fadp-01. To our knowledge, FADP is the first protocol coupling cryptographic identity attestation, payment finality, and self-custody (Σ = 5) in a single HTTP response.

Bullet summary

  • Introduces FADP (Fluid Agentic Payment Protocol), the first HTTP-native two-phase payment protocol tailored for autonomous AI agent transactions, combining cryptographic identity attestation with payment finality in a single HTTP response.
  • Extends HTTP 402 status code through three custom header namespaces (X-FADP-*, X-Pauli-*, X-FLDP-*) that separately handle payment challenges, zero-knowledge identity proofs, and request signing, enabling verifiable identity-payment linkage via headers alone.
  • Employs a seven-key architecture partitioned across three trust levels, ensuring strict self-custody by keeping private keys local to the agent device and safeguarding against server compromise and unauthorized fund access.
  • Utilizes Groth16 zk-SNARKs combined with a unique four-dimensional nonce vector to produce payment proofs that guarantee uniqueness, unforgeability, and replay impossibility, ensuring robust security guarantees.
  • Formally proves four key theorems establishing protocol correctness, liveness independence, replay protection, and strong identity-payment binding, underlining rigorous security foundations.

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

arXiv preprint arXiv Governance and Policy Trust and Identity

Thamilvendhan Munirathinam

Published 2026-06-04

Venue: arXiv

Reviewer: The paper addresses governance and control mechanisms for autonomous LLM agents operating with real credentials and infrastructure access, focusing on a cooperative in-band deny signal called the Recuse Signal. This relates to multi-agent security concerns including cooperative compliance, governance controls, and interaction protocols. While it does not focus on traditional attack or defense methods, it deals with control problems and agent coordination in multi-agent systems, thus fitting the requested topic reasonably well.

Open source record

Abstract

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.

Bullet summary

  • Introduced the Recuse Signal, an in-band, lightweight deny signal emitted by servers (e.g., SSH banner, PostgreSQL NOTICE) to request autonomous LLM agents voluntarily withdraw from off-limit resources, serving as a cooperative governance control rather tha...
  • Developed two zero- or low-footprint adapters—a PAM hook with SSH banner and a PostgreSQL wire-protocol proxy—to deploy the Recuse Signal without changing existing server infrastructure, validated on a live production host.
  • Conducted controlled experiments with deployed LLM agents (OpenAI GPT-4o, GPT-4o-mini, Claude Code) performing benign tasks, showing 100% recusal when the Recuse Signal was present contrasted with 100% task completion when absent, empirically measuring agen...
  • Demonstrated that the Recuse Signal functions as a cooperative, overridable mechanism: the most capable LLM agents could override the recusal upon explicit operator authorization, whereas other agents consistently deferred to the signal reflecting on-host p...
  • Positioned the Recuse Signal as an orchestration and governance improvement—enabling explicit operator intent communication and auditability—without being a hard security control, acknowledging potential for malicious agents to ignore or abuse the signal.

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

Merged record merged scholarly record arXiv Benchmarks and Evaluation Orchestration Risk

Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao

Published 2026-06-04

Venue: arXiv

Reviewer: The paper focuses on multi-agent systems of large language models and investigates their collaborative competence, including coordination and maintaining shared understanding, which relates to aspects of multi-agent interaction and potential security concerns such as coordination failures and trust boundaries. Although the paper is grounded in CSCW and does not explicitly emphasize security risks or attacks, its emphasis on collaboration, coordination, and interaction failures is relevant to the requested topic of multi-agent security research. Thus, it fits moderately well.

Open source record

Abstract

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

Bullet summary

  • Introduces CollabSim, a configurable simulation framework grounded in Computer-Supported Cooperative Work (CSCW) research, designed to systematically evaluate collaborative competence of Large Language Model (LLM) agents in multi-agent systems.
  • Highlights a gap in current multi-agent system evaluations, which often focus on task outcomes or single-agent proficiency but overlook collaborative process dynamics such as establishing common ground and repairing misalignments.
  • Leverages classic CSCW experimental paradigms by manipulating interaction conditions like communication bandwidth, information visibility, and group size to study their effects on agent collaboration.
  • Incorporates a probing module that captures agents' internal mental models and reasoning confidence at an action-level granularity, enabling deeper insight beyond observable behaviors.
  • Validates the framework across four collaborative tasks (Shape Factory, DayTrader, Hidden Profile, Map Task), demonstrating CollabSim's effectiveness at detecting condition effects, distinguishing model performance patterns, and revealing task-dependent col...

WebMCP Tool Surface Poisoning: Runtime Manipulation Attacks on LLM Agents

Merged record merged scholarly record arXiv Orchestration Risk Memory Poisoning Governance and Policy

Lin-Fa Lee, Yi-Yu Chang, Chia-Mu Yu, Kuo-Hui Yeh

Published 2026-06-04

Venue: arXiv

Reviewer: The paper addresses security risks and runtime manipulation attacks on AI agents via tool surface poisoning in WebMCP protocol, which is directly related to multi-agent security concerns including attacks, vulnerabilities, and defenses in systems with multiple interacting AI agents. It also discusses attack types like Tool Hijacking and Tool Framing affecting agents, and outlines mitigation strategies relevant to governance and control problems. This aligns well with the requested topic of multi-agent security research.

Open source record

Abstract

WebMCP is a newly emerging protocol that enables websites to expose tools directly to AI agents, bypassing traditional user interfaces and introducing new security risks. The dynamic exposure of agent-accessible tools in WebMCP expands the attack surface of web sessions, especially when third-party scripts are involved. In this study, we identify a new potential threat, termed Mid-Session Tool Injection (MSTI), in which attackers leverage third-party scripts to inject malicious tools during an active session. To better characterize this threat, we classify MSTI based on the stage and target of manipulation, distinguishing between Tool Hijacking and Tool Framing. Tool Hijacking modifies the set of tools visible to the agent through mechanisms such as the AbortSignal API or race conditions during tool registration. In contrast, Tool Framing influences the agent's perception of tool roles through metadata fields such as tool name, description, readOnlyHint, and inputSchema. Our implementation demonstrates that both Tool Hijacking and Tool Framing can successfully disrupt the intended functionality of WebMCP. Based on these results, we outline potential mitigation directions and provide security design recommendations for WebMCP, including binding tool identity to its origin, ensuring lifecycle consistency, enforcing data boundaries for third-party tools, and maintaining traceable logs of tool registration and invocation. These findings indicate that MSTI arises from WebMCP's unique tool lifecycle and structured metadata, making the tool surface itself an emerging security concern.

Bullet summary

  • WebMCP introduces a protocol for websites to directly expose tools to AI agents dynamically, expanding the attack surface during active sessions.
  • The study identifies a new security threat, Mid-Session Tool Injection (MSTI), where attackers use third-party scripts to inject malicious tools during WebMCP sessions.
  • MSTI attacks are classified into Tool Hijacking, which alters the legitimate tool set exposed to agents, and Tool Framing, which manipulates tool metadata affecting agent perception.
  • Experiments with state-of-the-art LLMs show MSTI attacks can stealthily cause agents to invoke malicious tools, leak data, or deviate workflows without detection.
  • Attack success depends on timing (early injection before first tool invocation) and on leveraging metadata fields such as description and readOnlyHint for semantic framing.