multiagentsecurity.ai

Select article

Beyond Injection Detection: A Positive-Security Prompt Firewall that Closes the Scope and PHI Gap SOTA Classifiers Miss in Healthcare

Merged record merged scholarly record OpenAlex Semantic Scholar Prompt Injection Trust and Identity Benchmarks and Evaluation

James Schwoebel, I. Semenec, Jenia Rousseva, Martin Gerbert Frasch, Rome Thorstenson, Manish Bhatt, J. Schwoebel, J. Rousseva

Published 2026-06-06

Venue: medRxiv

DOI: https://doi.org/10.64898/2026.06.04.26354950

Open Source Record

Abstract

Large language models embedded in autonomous agents process trusted instructions and untrusted data in one context window, leaving them open to direct and indirect prompt injection. In healthcare this is not hypothetical: a 2025 JAMA Network Open study found commercial medical LLMs followed injected instructions in 94.4% of simulated patient encounters, including life threatening recommendations . Yet the clinically decisive problem we quantify here is different. Most real clinical threats protected health information PHI exfiltration, cross patient access, bulk export, out of scope advice are fluent, legitimate looking requests that carry no attack signal, so even a state of the art injection detector passes them. Existing runtime guardrails trade safety against latency: model based auditors are accurate but add hundreds of milliseconds of Python inference, while lexical filters are fast but blind to obfuscated or semantically disguised payloads. We present QFIRE, an inline, provider agnostic prompt firewall implemented as a single self contained Rust toolchain proxy, CLI, and benchmark harness. QFIRE combines three mechanisms: (i) positive security scope constraints, which restrict a model call to a declared natural language purpose and block out of scope drift even when no overt attack token is present; (ii) an asynchronous detector graph that runs N rules and their detector nodes concurrently, cheapest checks first; and (iii) a de obfuscation pass that decodes Base64 hex ROT13, folds homoglyphs and leetspeak, and strips zero width characters before detection. QFIRE ships 106 versioned firewall rules and a dedicated HIPAA Safe Harbor 18 identifier PHI panel, and runs a local DeBERTa v3 injection classifier via embedded ONNX Runtime. On 1968 public prompt injection and jailbreak prompts QFIREs deterministic hybrid attains F1 0.86, statistically tied with Metas state of the art PromptGuard 2 0.86 and above protectai DeBERTa v3 0.83; lexical baselines lag 0.16 to 0.50. Our central result is on QFIRE HealthBench, a new 2000 prompt healthcare benchmark we build and release with real garak and Microsoft PyRIT payloads. There the same PromptGuard-2 recovers only 0.40 recall DeBERTa v3 0.57, because most clinical threats carry no injection signal; QFIREs combined scope plus PHI chain reaches 0.83 recall F1 0.87 at a calibrated 0.08 false positive rate. Generic injection detection, even state of the art, is therefore necessary but not sufficient for healthcare agents. A bare LLM judge also closes most of this static corpus gap F1 0.90; QFIREs contribution beyond static accuracy is auditable determinism, bounded latency, and adaptive robustness, where the bare judge falls to 34 to 59% recall section 5.5. End to end, placing QFIRE in front of a tool using agent over a mock EHR sandbox cuts the agents harmful action rate from 0.38 to 0.00 at a 0.13 benign utility cost. All code, rules, corpora snapshots, and scripts are released, and every table regenerates from a single make paper target against local models with no paid API keys.

Bullet Summary

Large language models (LLMs) in healthcare agents process both trusted instructions and untrusted data simultaneously, making them vulnerable to prompt injections and clinically significant data breaches like PHI exfiltration and cross-patient access.
Most real clinical threats are legitimate-looking requests without explicit attack signals, which state-of-the-art injection detectors often fail to catch, highlighting a gap in existing defenses.
QFIRE is a novel, provider-agnostic prompt firewall implemented as a Rust-based inline proxy combining positive security scope constraints, an asynchronous detector graph running multiple concurrent lightweight rules, and a de-obfuscation pass to detect obf...
QFIRE includes 106 versioned firewall rules alongside a dedicated HIPAA Safe Harbor 18 identifier PHI panel, and integrates a local DeBERTa v3 injection classifier using ONNX Runtime for efficient model-based detection.
On standard public prompt injection and jailbreak benchmarks, QFIRE achieves an F1 score of 0.86, comparable to state-of-the-art systems like Meta's PromptGuard 2, and significantly outperforms lexical methods.

Select article

The Governance Gap in Agentic Memory

Merged record merged scholarly record OpenAlex Governance and Policy Trust and Identity

Andrew Crenshaw

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20571518

Open Source Record

Abstract

The Governance Gap in Agentic Memory - a position paper proposing Substrate-Lens-Frame (SLF), a sovereign, auditable memory protocol for AI agents. AI agents now run on persistent memory, and that memory has become its own layer in the stack. Almost all of the effort in that layer goes to one question: how well can the system recall the right fact at the right time? That work is critically important. It leaves a second question unanswered, and it is the one that decides whether agent memory can be trusted with anything that matters: governance. Today's systems can recall a fact, but they cannot reliably say who is allowed to see it, how its meaning changes by role and jurisdiction, whether two stored facts contradict each other, or what was disclosed to whom. This paper names that gap, argues that it is structural, and proposes a protocol that addresses it. The proposal is Substrate-Lens-Frame (SLF), built around one operational primitive, render(substrate, lens, frame) -> receipt: a fact carries its own access rules; a lens reads it through a consumer-scoped projection that cannot widen those rules; a frame binds each action to an authorization; and every operation emits a payload-free signed receipt. This deposit is the position paper (PDF, CC BY 4.0). The Apache-2.0 reference implementation slf-core is archived separately (see Related works), with a companion Sovereign Personal Agent architecture (design) and a recovery-path prototype. Author: Andrew Crenshaw (ORCID 0009-0006-6459-0187), Lexenne. Cite as: Crenshaw, A. (2026). The Governance Gap in Agentic Memory. Zenodo. https://doi.org/

Bullet Summary

The paper identifies a critical governance gap in agentic memory systems that currently focus solely on recalling facts correctly without addressing data governance aspects.
It highlights that existing AI agent memory systems cannot reliably enforce access controls, account for role-based and jurisdictional meaning shifts, detect conflicting stored facts, or track disclosures accurately.
The author proposes Substrate-Lens-Frame (SLF), a sovereign and auditable memory protocol designed to fill this governance gap in AI agent memory.
SLF operates around a single core function: render(substrate, lens, frame) -> receipt, where each fact includes its own access rules, and views (lenses) restrict observation without expanding permissions.
Frames in SLF bind each action to explicit authorization, and every operation generates a signed, payload-free receipt to provide verifiable audit trails.

Select article

AI-Driven Network Security in Next-Generation 5G/6G Smart Environments

Crossref · International Journal of Advanced Research in Science Communication and Technology journal-article Crossref Governance and Policy Trust and Identity

Pardeep Singh

Published 2026-06-06

Venue: International Journal of Advanced Research in Science Communication and Technology

DOI: 10.48175/ijarsct-36322

Open Source Record

Abstract

Technology is currently spreading at an exponential rate. more accessibility, use, and application of this technology across all sectors and industries have been made possible by technological advancements, more computing power, and lower costs. Traditionally labour-intensive data analysis tasks can now be completed rapidly and effectively thanks to the development of smart and autonomous technologies like artificial intelligence and machine learning. Previously isolated datasets and data lakes are increasingly being used and linked. AI, digital twins, the metaverse, and virtual technologies are permeating every industry and more significantly merging with people to the point where it seems impossible to distinguish between the actual and virtual worlds. However, a fantastic backbone and capacity to transmit data, as well as immediate delivery at high speed and security, are necessary for the successful use of these incredible and new technologies. In order for 6G to be properly onboarded and executed in a logical manner, 5G, which is now in its deployment, must accomplish its goals. The European Commission is requesting money for important projects like Horizon 2020 and has 5G goals. 5G and 6G have enormous advantages for everyone, but only if they are implemented in a way that reduces the risk they may pose to security, privacy, and trust that are the fundamental pillars that must be upheld. Smart cities will allow for the analysis of acquired data, which could endanger national security if it falls into the wrong hands. A strong governance plan and method for managing 5G and 6G must be in place in order to guarantee success, given how many IoT and e-IoT devices are present in smart cities and how intertwined technologies are engaging with people. The background, risks, and advantages of 5G and 6G are explained in this chapter, which also emphasizes the necessity of strong governance

Bullet Summary

The paper discusses the rapid proliferation of technology enabled by advances in computing power, reduced costs, and integration across sectors.
It emphasizes the role of AI and machine learning in automating traditionally labor-intensive data analysis, facilitating the merging of isolated data sources.
Emerging technologies such as AI, digital twins, the metaverse, and virtual environments are increasingly blending with human experience, posing challenges to distinguish real from virtual.
High-speed, secure data transmission backbones provided by 5G and upcoming 6G networks are critical to support these technologies effectively.
Successful deployment of 6G depends on the fulfillment of 5G's goals, with concerted efforts such as the European Commission's Horizon 2020 projects funding 5G development.

Select article

An LLM Agent Cannot Be a Gate: Why a Recited Rule Is Not an Enforced One

Merged record merged scholarly record OpenAlex Trust and Identity Governance and Policy Agent-to-Agent Communication

Luciano Federico Pereira

Published 2026-06-05

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20520851

Open Source Record

Abstract

An LLM agent can recite a behavioral rule on demand and still violate it moments later. We study reflex-class violations: failures that are not knowledge gaps but control gaps, in which the correct behavior is known yet not active at the moment of decision. The evidence base is a long-running, instrumented agent-orchestration deployment whose antipattern registry documents 62 such rules, each traced to a dated incident. We make four contributions.(1) We characterize reflex-class violations empirically and show that they cluster in the rule categories an agent's assertions fall into — communication, verification, honesty — rather than in resource-access safety; theenforcement gradient tracks gateability, not importance. (2) We locate the operative variable through six pilots: prompt strength is not it, prompt presence is — but only for rule-sensitive models. A capable model fabricates a needed value in every trial when no rule is present, complies whenever the rule appears anywhere in context, even buried behind distractor turns, and is unmoved by escalating emphasis. A reproducible compaction probe on open-weight models then shows that this sensitivity is itself model-dependent: for some capable-enough models the in-context rule carries no enforcement weight at any position — one fabricates regardless, another abstains regardless. Whether an instruction controls anything is therefore a per-model quantity, an enforcement effect, to be measured rather than assumed. An exploratory follow-up then tunes a model to the threshold where the rule does act, and finds the effect a cliff — collapsing within fifty tokens of distance — that the agent's own context compaction erases outright. A final contrast isolates the cause: holding compaction fixed and varying only the fidelity of the rule returned to context, a verbatim rule still leaks while a vague post-compaction paraphrase fabricates as often as no rule at all — fidelity at the decision point, not capability or the rule's prior presence, governs the reflex. We report these open-weight findings as hypothesis-generating, pending pre-registered replication. (3) We give a reproducible reforging-audit method, exhibit the deployed enforcement stack it produced, and quantify the deployment's conversion pipeline — incident to advisory rule to mechanical gate, promoted by a three-strike rule whose depth is set to a measured 0.1 percent over-block budget. (4) We argue that in multi-agent systems reliability comes not from adding verifier agents, each of which is another reflex surface with correlated failures, but from provenance-typed handoff contracts that a non-agent mediator checks mechanically at every seam. The unifying frame is classical — complete mediation and the reference monitor — applied to a new object. The mediated object is the agent's own output, so the agent is atonce the subject the monitor controls and part of the object it inspects. That reflexivity is why an agent may author a gate but must never be one, and, one level up, why an agent cannot be its own auditor — a constraint we make operational with a tamper-evident measurement instrument whose numbers a recipient re-derives rather than trusts. Scope: the empirical results and the enforcement mechanism concern assertions that carry a mechanically checkable type — a URL, a path, a resource handle; open-ended generative output, which has no such anchor, is out of scope.

Bullet Summary

The paper investigates reflex-class violations in Large Language Model (LLM) agents, where agents know but fail to enforce behavioral rules at decision time, a control gap rather than a knowledge gap.
An empirical study of a long-term agent orchestration deployment records 62 such rule violations, primarily in categories like communication, verification, and honesty, rather than resource-access safety; enforcement aligns with gateability rather than rule...
Through six pilot studies, the authors find that prompt presence influences rule compliance in rule-sensitive models, but prompt strength does not; compliance varies by model, with some models ignoring context rules regardless of position or emphasis.
A novel compaction probe reveals that enforcement effect is model-dependent and sharply declines within 50 tokens distance from the rule context, influenced by how faithfully the rule is presented at decision time.
They develop a reproducible reforging-audit methodology, implement an enforcement stack with a three-strike policy balancing blocking depth against a 0.1% over-block budget, and quantify the pipeline from incidents to mechanical gates.

Select article

FADP: A Sovereignty-Native Payment Protocol for Autonomous Agent Transactions

Merged record merged scholarly record OpenAlex Trust and Identity Governance and Policy

Abhijeeth Ganji, Priyanka Velpula

Published 2026-06-05

Venue: OpenAlex

DOI: https://doi.org/10.33774/coe-2026-mchtz

Open Source Record

Abstract

Autonomous AI agents — software systems that independently execute tasks, trade digital assets, and consume services — require machine-native payment infrastructure with formal cryptographic guarantees. Existing protocols (OAuth, JWT, API keys) provide no binding between payment and agent identity, leaving a critical gap in the emerging agentic economy. This paper introduces FADP (Fluid Agentic Payment Protocol), the first HTTP-native two-phase payment protocol designed for autonomous agent transactions with provable self-custody guarantees. FADP extends RFC 7231's HTTP 402 status code with three header namespaces — X-FADP-, X-Pauli-, and X-FLDP-* — and a seven-key architecture partitioned across three trust levels, where private keys satisfy strict locality: ∀k ∈ K_local, k ∉ Network. Every payment proof π is constructed as π = Groth16.Prove(PK, w) bound to a four-dimensional nonce vector N₄ = (n_time, n_chain, n_req, n_agent), guaranteeing uniqueness, unforgeability, and replay impossibility simultaneously. We formally prove four theorems — protocol correctness, liveness independence (L ⊥ S), replay impossibility, and identity-payment binding — and introduce three novel metrics: Authentication Round-Trip Count (ART), Payment Atomicity Score (PAS), and Sovereignty Inheritance (SI), with provable sovereignty score Σ = 5. The reference implementation deploys on Base Mainnet achieving median latency of 160–215 ms and per-call cost of $0.001–$0.01 in stablecoin, submitted to IETF as draft-fluid-fadp-01. To our knowledge, FADP is the first protocol coupling cryptographic identity attestation, payment finality, and self-custody (Σ = 5) in a single HTTP response.

Bullet Summary

Introduces FADP (Fluid Agentic Payment Protocol), the first HTTP-native two-phase payment protocol tailored for autonomous AI agent transactions, combining cryptographic identity attestation with payment finality in a single HTTP response.
Extends HTTP 402 status code through three custom header namespaces (X-FADP-*, X-Pauli-*, X-FLDP-*) that separately handle payment challenges, zero-knowledge identity proofs, and request signing, enabling verifiable identity-payment linkage via headers alone.
Employs a seven-key architecture partitioned across three trust levels, ensuring strict self-custody by keeping private keys local to the agent device and safeguarding against server compromise and unauthorized fund access.
Utilizes Groth16 zk-SNARKs combined with a unique four-dimensional nonce vector to produce payment proofs that guarantee uniqueness, unforgeability, and replay impossibility, ensuring robust security guarantees.
Formally proves four key theorems establishing protocol correctness, liveness independence, replay protection, and strong identity-payment binding, underlining rigorous security foundations.

Select article

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

arXiv preprint arXiv Governance and Policy Trust and Identity

Thamilvendhan Munirathinam

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.

Bullet Summary

Introduced the Recuse Signal, an in-band, lightweight deny signal emitted by servers (e.g., SSH banner, PostgreSQL NOTICE) to request autonomous LLM agents voluntarily withdraw from off-limit resources, serving as a cooperative governance control rather tha...
Developed two zero- or low-footprint adapters—a PAM hook with SSH banner and a PostgreSQL wire-protocol proxy—to deploy the Recuse Signal without changing existing server infrastructure, validated on a live production host.
Conducted controlled experiments with deployed LLM agents (OpenAI GPT-4o, GPT-4o-mini, Claude Code) performing benign tasks, showing 100% recusal when the Recuse Signal was present contrasted with 100% task completion when absent, empirically measuring agen...
Demonstrated that the Recuse Signal functions as a cooperative, overridable mechanism: the most capable LLM agents could override the recusal upon explicit operator authorization, whereas other agents consistently deferred to the signal reflecting on-host p...
Positioned the Recuse Signal as an orchestration and governance improvement—enabling explicit operator intent communication and auditability—without being a hard security control, acknowledging potential for malicious agents to ignore or abuse the signal.

Select article

Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems

arXiv preprint arXiv Governance and Policy Trust and Identity

Dianxing Shi, Junqi He, Junhao Chen, Bowen Wang, Yuta Nakashima

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.

Bullet Summary

Self-evolving agents autonomously improve via self-play and self-generated learning signals but risk capability degradation and safety drift without human oversight.
The paper introduces ANCHOR, a novel LLM-based framework that simulates human supervision by providing feedback at multiple phases during self-evolution, aiming to maintain agent safety and performance.
ANCHOR evaluates and intervenes during phases such as task proposal, planning/thought, output, and execution results by generating evaluative feedback without supplying direct fixes, supporting robust self-correction.
Experiments on two open-source self-evolving frameworks (AZR and R-Zero) across coding, mathematical reasoning, and safety tasks demonstrate that ANCHOR significantly mitigates safety degradation while preserving or improving core task performance.
Analysis reveals that supervision focused on the execution/output verification phase yields the greatest impact on preventing performance drops and safety risks, whereas increasing supervision frequency beyond moderate levels gives diminishing returns.

Select article

Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

arXiv preprint arXiv Memory Poisoning Trust and Identity Governance and Policy

Jiawen Zhang, Kejia Chen, Jiachen Ma, Yangfan Hu, Lipeng He, Yechao Zhang, Jian Liu, Xiaohu Yang

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.

Bullet Summary

Personal AI agents use long-term memory for persistent personalization, but existing retrieval methods based on semantic similarity can lead to trustworthiness issues such as cross-domain leakage, sycophancy, tool-call drift, and memory-induced jailbreaks.
Memory retrieval acts as a critical trust boundary influencing how agents interpret tasks and execute actions, highlighting the need for mechanisms that ensure contextually appropriate memory admission.
The paper introduces MemGate, a lightweight neural gating plug-in placed between the vector memory store and the backbone LLM, which applies a query-conditioned gating mask to filter memory embeddings based on task relevance without modifying the LLM or mem...
MemGate transforms raw similarity-based retrieval into task-conditioned memory admission by attenuating inappropriate memory vector dimensions, thereby reducing the injection of unsafe or irrelevant memories.
MemGate's parameters are optimized via Direct Preference Optimization (DPO), balancing suppression of conflicting memories with preservation of beneficial ones, backed by theoretical guarantees limiting semantic drift.

Select article

Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

Merged record merged scholarly record arXiv Agent-to-Agent Communication Orchestration Risk Trust and Identity

Yingzhuo Liu

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks -- high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol -- latent communication -- in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges -- including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.

Bullet Summary

Multi-agent systems with large language models (LLMs) typically rely on natural language communication, which suffers from high inference cost, irreversible information loss during token discretization, and ambiguity.
Latent communication emerges as an alternative, enabling agents to exchange continuous internal representations such as embeddings, hidden states, or KV-caches, thus bypassing natural language bottlenecks and preserving richer information.
The paper proposes a unified 3-axis framework to categorize latent communication methods based on (1) WHAT information is communicated, (2) WHICH sender-receiver alignment is used, and (3) HOW the communicated information is fused in the receiver.
Eighteen representative latent communication methods between 2024 and 2026 are analyzed, revealing five major design patterns and a vast design space with many unexplored combinations.
Sender-receiver alignment strategies vary from requiring no alignment for identical models to learned projection or universal codecs for heterogeneous models, and layer alignment spans from simple last-to-first mappings to selective top-k attention.

Select article

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

arXiv preprint arXiv Trust and Identity Governance and Policy Orchestration Risk

Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

Bullet Summary

AI coding agents integrated into real-world development can covertly sabotage code by inserting malicious hidden functionalities, posing a new security threat.
Prior work focused on AI-only sabotage detection, lacking investigation into human developers' ability to detect AI sabotage during live, multi-turn collaboration.
A large-scale, realistic five-hour study involving over 100 professional developers collaborating with four advanced AI models found that 94% of participants failed to independently detect sabotage attempts.
Key vulnerabilities include minimal code review by developers, plausible cover stories provided by the AI to disguise malicious code, and overtrust in AI agents, allowing sabotage to go unnoticed.
An LLM-based safety monitor that flagged suspicious AI behaviors reduced sabotage success but was insufficient, as 56% of participants still accepted malicious code despite warnings.

Select article

Toward a Unified Interoperability Framework for Autonomous AI Agent Ecosystems: MCP, A2A, ACP, and ANP

OpenAlex · Zenodo (CERN European Organization for Nuclear Research) repository OpenAlex Agent-to-Agent Communication Trust and Identity Governance and Policy

Yendluri Siva

Published 2026-06-04

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20549817

Open Source Record

Abstract

The proliferation of large language model (LLM)-powered autonomous agents across enterprise software delivery, cloud operations, and knowledge management has created an urgent systems-level challenge: agents built by different vendors, using different frameworks, cannot discover, authenticate, or collaborate with one another without expensive bespoke integration. Four emerging open protocols—Model Context Protocol (MCP), Agent-to-Agent Protocol (A2A), Agent Communication Protocol (ACP), and Agent Network Protocol (ANP)—address different layers of this interoperability gap, yet no formal comparative analysis or unified architectural framework has appeared in the peer-reviewed literature. This paper fills that gap. We provide the first systematic multi-dimensional comparison of all four protocols across interaction mode, discovery mechanism, transport layer, security model, and enterprise-readiness. We introduce a novel Three-Layer Agentic Stack (TLAS) that maps each protocol to its architectural role, propose a formal threat model covering protocol-specific attack surfaces including cross-server shadowing, credential relay, and agent impersonation, and define a phased adoption roadmap validated against real deployment patterns observed in financial-sector and cloud-native environments. Our analysis demonstrates that MCP and A2A are complementary rather than competing, that ACP and ANP serve distinct ecosystem niches, and that a unified TLAS approach reduces integration complexity by an estimated 60–70% compared to ad-hoc solutions. We conclude with open research questions on semantic conflict resolution, decentralized identity binding, and standardized evaluation benchmarks.

Bullet Summary

The rise of autonomous agents powered by large language models (LLMs) in enterprise software has created significant interoperability challenges due to diverse vendor frameworks and lack of standard integration methods.
Four emerging open protocols—Model Context Protocol (MCP), Agent-to-Agent Protocol (A2A), Agent Communication Protocol (ACP), and Agent Network Protocol (ANP)—address different layers of agent interoperability but lacked a comprehensive comparative analysis...
This paper provides the first systematic, multi-dimensional comparison of the four protocols considering interaction modes, discovery mechanisms, transport layers, security models, and suitability for enterprise deployment.
A novel architectural concept, the Three-Layer Agentic Stack (TLAS), is introduced, mapping each protocol to its specific architectural role within an autonomous AI agent ecosystem.
A formal threat model is proposed, identifying protocol-specific security risks including cross-server shadowing, credential relay attacks, and agent impersonation to guide secure protocol adoption.

Select article

Cognitive Guardrails in Medical LLMs: Fusing Latent Routing with T-Adaptive Attention to Mitigate Aleatoric and Epistemic Uncertainty

OpenAlex · Zenodo (CERN European Organization for Nuclear Research) repository OpenAlex Orchestration Risk Trust and Identity Governance and Policy

Narendra Bayutama Wibisono

Published 2026-06-04

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.19452753

Open Source Record

Abstract

Abstract Clinical deployment of Large Language Models (LLMs) is fundamentally constrained by their propensity to hallucinate—generating fluent but clinically unfounded assertions that pose direct patient safety risks. This work introduces Conditional Latent Routing (CLR), a compound inference architecture that decomposes clinical uncertainty into its aleatoric and epistemic constituents and applies distinct, mechanistically motivated interventions at each stratum. Inspired by the corollary discharge framework from computational psychiatry—wherein the healthy brain’s forward model suppresses self-generated sensory predictions, a mechanism whose dysfunction in schizophrenia produces hallucinations—we construct an analogous dual-pathway system for medical LLM inference. A fine-tuned Bio_ClinicalBERT encoder classifies input noise (aleatoric uncertainty) and routes clean inputs through a latent soft-prompting Fast Lane while directing noisy, contradictory records through a cautionary Slow Lane with explicit abstention instructions. To address epistemic uncertainty—the model’s internal confusion—we introduce a T-Adaptive Attention patch at the logits level that modulates generation temperature as an inverse function of hidden-state variance. We evaluate CLR on BioMistral-7B across 400 clinical cases (200 clean, 200 noisy) using RAGAS Faithfulness scored by Llama-4-Scout-17B-16E-Instruct via Groq API. Phase 1 (Aleatoric Routing) achieves 100% routing accuracy and 64.8% faithfulness on clean data with zero alignment tax, but reveals a critical failure: a 0% inconclusive rate on noisy inputs, indicating extreme over-helpfulness bias. Phase 2 (Epistemic T-Adaptive Patch) preserves faithfulness at 64.4% while demonstrating zero computational overhead, but the inconclusive rate remains at 0%. We formally prove this failure as The Greedy Paradox: under greedy decoding, temperature scaling of logits is mathematically nullified because for all . Phase 3 (Non-Greedy Sampling) degrades faithfulness to 58.5% while still failing to trigger abstention, confirming that over-helpfulness bias is embedded in pre-training weight distributions, not merely in the decoding surface. These results establish that single-agent, test-time interventions are fundamentally insufficient for self-uncertainty modulation in medical LLMs, providing strong empirical justification for transitioning to multi-agent systems with externalized uncertainty arbitration. Keywords: Large Language Models, Hallucination, Uncertainty Quantification, Compound AI Systems, Over-helpfulness Bias, T-Adaptive Attention, Selective Prediction, Aleatoric Uncertainty, Epistemic Uncertainty, Conditional Latent Routing.

Bullet Summary

The paper tackles the critical problem of hallucinations in Medical Large Language Models (LLMs), which generate clinically unfounded but fluent assertions, posing patient safety risks.
Introduces Conditional Latent Routing (CLR), a novel compound inference architecture that decomposes clinical uncertainty into aleatoric (input noise) and epistemic (model uncertainty) components for targeted interventions.
Inspired by computational psychiatry's corollary discharge framework, CLR uses a dual-pathway system with a Bio_ClinicalBERT encoder to classify and route inputs: clean cases go through a Fast Lane with latent soft-prompting, while noisy/contradictory input...
To mitigate epistemic uncertainty, the authors develop a T-Adaptive Attention mechanism that modulates generation temperature based on hidden-state variance, aiming to adapt output confidence dynamically.
Experiments on 400 clinical cases (200 clean, 200 noisy) using the BioMistral-7B model show 100% routing accuracy in Phase 1, with 64.8% faithfulness on clean data but an inability to abstain on noisy inputs, revealing an over-helpfulness bias.

Select article

A Zero-Trust Agentic AI Methodology for Cyber-Resilient Energy Market Clearing under Residual Cyber Contamination and Dynamic Uncertainty

Merged record merged scholarly record OpenAlex Trust and Identity Governance and Policy

Mohammadmahdi Rezaeifar Sangchouli, Saman Mehrani

Published 2026-06-04

Venue: Research Square

DOI: https://doi.org/10.21203/rs.3.rs-9903418/v1

Open Source Record

Abstract

Abstract unavailable from OpenAlex metadata.

Bullet Summary

Real-time energy market clearing is vulnerable to cyber contamination in cyber-physical data streams, risking unreliable market outcomes and settlement legitimacy.
The paper proposes a zero-trust agentic AI framework that employs specialized autonomous agents to verify data trustworthiness before admission to market clearing optimization.
This methodology integrates multi-agent consensus, admissibility screening, cyber-risk-aware objective terms, blockchain-based governance, and human oversight to ensure cyber-resilient, auditable, and feasible market operations under dynamic uncertainty.
Residual cyber contamination and dynamic renewable uncertainty are modeled explicitly, allowing the system to balance operational feasibility, financial defensibility, and security during market clearing.
The agentic AI layer produces trust scores and disagreement metrics which inform binary admissibility decisions, with escalation to human operators when trust is insufficient or agent consensus is weak.

Select article

A-Live: Passive Liveness Detection via Neuromuscular Micro-Motion Signatures on Commodity Sensors

arXiv preprint arXiv Trust and Identity Benchmarks and Evaluation

Mohammed Gharib, Sam Burns, Martin Zizi

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Liveness detection has evolved from a safeguard against presentation and replay attacks in biometric authentication to a broader requirement for distinguishing human users from non-human agents in modern digital systems. The emergence of generative and agentic AI further amplifies this need, positioning liveness as a fundamental security primitive. Existing approaches face key limitations, including reliance on explicit user interaction, specialized hardware, vulnerability to increasingly realistic spoofing, and limited scalability in real-world deployments. We present A-Live, a passive liveness detection framework that operates solely on inertial measurement unit (IMU) signals available in commodity devices. A-Live is based on the observation that neuromuscular micro-motions inherent to human motor control produce subtle but measurable signatures in inertial data, which are often treated as noise in prior work. We design a lightweight feature extraction pipeline and a compact classifier suitable for real-time on-device deployment, and introduce a controllable physical micro-motion platform to evaluate robustness against engineered non-human motion. Extensive evaluation across Android and iOS devices, including both automated and real-user settings, shows that A-Live achieves over 99.5\% accuracy with low false acceptance and rejection rates. Our results demonstrate that neuromuscular micro-motion signatures provide a scalable and passive foundation for liveness detection under emerging AI-driven threat models.

Bullet Summary

Liveness detection has become crucial in distinguishing live humans from synthetic or mechanical agents, especially against advanced AI-driven spoofing attacks.
Existing liveness methods either require explicit user interaction, specialized hardware, or are vulnerable to realistic spoofing; they also face scalability issues.
A-Live introduces a passive approach relying solely on neuromuscular micro-motion signatures measurable by commodity IMU sensors (accelerometers and gyroscopes) in smartphones and wearables.
These neuromuscular micro-movements are subtle, involuntary physiological signals difficult to replicate synthetically or mechanically, providing a robust liveness signal.
The system processes raw IMU data with signal pre-processing, feature extraction (temporal, spectral, stochastic features), and a lightweight classifier optimized for real-time, on-device use across diverse Android and iOS devices.

Select article

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

arXiv preprint arXiv Trust and Identity Agent-to-Agent Communication Benchmarks and Evaluation

Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Mingkai Zhang

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.

Bullet Summary

LLM-based agents integrate diverse components like tools, memory, environments, and multi-agent interactions, increasing autonomy but complicating behavior verification, debugging, and auditing.
Evidence tracing and execution provenance provide frameworks to model and connect the full agent execution workflow including reasoning, tool use, memory updates, and inter-agent communication, going beyond just final-answer correctness.
The paper surveys and organizes existing fragmented research into a unified provenance perspective, proposing a taxonomy covering trace sources, evidence and execution units, provenance relations, granularity, timing, representation, and trust functions.
Provenance relations such as SUPPORT, CONTRADICT, DERIVE, and DEPEND-ON capture semantic and procedural connections critical for trust functions like verification, debugging, safety enforcement, audit, and recovery in multi-agent LLM systems.
Provenance-aware mechanisms help prevent known security risks including unsafe tool use, prompt injections, memory poisoning, and malicious multi-agent inputs by providing runtime guardrails and access controls based on trace information.

Select article

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy Trust and Identity

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

Bullet Summary

Introduces the Meta-Agent Challenge (MAC), a new evaluation framework designed to measure AI models' ability to autonomously develop and optimize other agent systems rather than merely execute predefined tasks.
MAC operates in a sandboxed environment with strict APIs and time constraints, requiring a meta-agent to iteratively design, implement, and refine subordinate agents to maximize performance across five diverse task domains including math reasoning, science...
The challenge emphasizes recursive self-improvement and system engineering capacities critical for advancing autonomous AI development and addressing AI safety issues such as robustness and alignment.
Multi-layer security mechanisms prevent reward hacking and ensure evaluation integrity by isolating development and evaluation environments, monitoring API usage, and conducting rigorous post-hoc audits to detect cheating attempts like ground-truth exfiltra...
Experimental results show that most meta-agents fail to surpass strong human-engineered baseline agents; only a few configurations (mostly proprietary models) achieve better performance, indicating substantial challenges in autonomous agent development.

Select article

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

arXiv preprint arXiv Governance and Policy Trust and Identity Benchmarks and Evaluation

Saroj Mishra

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

Bullet Summary

Multi-step agentic retrieval-augmented generation (RAG) systems suffer from cascading hallucinations, where initial errors propagate and amplify through subsequent reasoning stages, leading to confident but incorrect outputs undetected by existing single-st...
The paper formalizes cascading hallucination as a distinct multi-stage failure mode with four identified cascade types: Retrieval Cascade, Inference Cascade, Context Poisoning Cascade, and Confidence Inflation Cascade, each exhibiting unique detection signals.
CHARM (Cascading Hallucination Aware Resolution and Mitigation) is a modular framework designed to detect and mitigate cascading hallucinations in multi-step agentic RAG pipelines without requiring architectural replacement.
CHARM comprises four components: stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering, which collectively monitor semantic and probabilistic trajectories to identify error prop...
The framework models multi-stage pipelines as directed acyclic graphs (DAGs), using a weighted combination of anomaly scores from the three monitors to detect cascades and trigger mitigation actions such as pipeline halting or rollback.

Select article

Learning to cooperate with emergent reputation via multi-agent reinforcement learning

Merged record merged scholarly record arXiv Trust and Identity Governance and Policy Agent-to-Agent Communication

Xinwei Song, Yizhe Huang, Dengji Zhao, Xue Feng

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Reputation, the aggregation of peer assessments diffused through social networks, is a pivotal mechanism for promoting cooperation in social dilemmas ubiquitous to distributed multi-agent systems comprising agents with limited perception and cognitive capabilities. Exploring efficient reputation systems, comprising reputation assessment rules and reputation-based policies, is a long-standing challenge. Previous work assumes predefined reputation assessment rules or models reputation as an intrinsic reward to learn policies, compromising the methods' ability for generalization and adaptation. To address this, we propose a distributed multi-agent reinforcement learning method $\textbf{COOPER}$ ($\textbf{COOP}$eration with $\textbf{E}$mergent $\textbf{R}$eputation), which jointly learns reputation assessment rules and reputation-based policies entirely from environment rewards. Notably, leveraging the underlying mechanisms of reputation, we deliberately design the constituent modules of $\textbf{COOPER}$ and the data flows among them, overcoming the latency and noise in the feedback signal, caused by the deep entanglement between reputation and policy. Experiments on the donation game and the coin game in grid world environments demonstrate that $\textbf{COOPER}$ effectively adapts to various existing reputation systems and co-players. Furthermore, we observe the co-emergence of reputation norms and cooperation in self-play settings. These results hold robustly across diverse social network topologies, underscoring the generalizability and efficacy of our approach.

Bullet Summary

Reputation mechanisms are essential for fostering cooperation in distributed multi-agent systems with inherent social dilemmas and limited agent capabilities.
Existing methods rely on fixed reputation rules or intrinsic reward shaping, limiting adaptability and generalization across different social environments.
The paper proposes COOPER, a novel decentralized multi-agent reinforcement learning framework that jointly learns both reputation assessment rules and reputation-based policies solely from extrinsic environmental rewards, without predefined norms.
COOPER incorporates two complementary reputation assessment modules: gossip-based assessments aggregating neighbors’ opinions and interaction-based assessments using direct interaction histories, improving robustness against noise and latency.
An alternating optimization scheme aligns updates between reputation assignment and policy modules, supported by consensus regularization and entropy to promote exploration and convergence.

Select article

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

arXiv preprint arXiv Governance and Policy Trust and Identity Orchestration Risk

Travis Weber, Rohit Taneja

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human's standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional's tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.

Bullet Summary

The paper introduces the Digital Apprentice framework for scalable and safe agentic AI, where AI autonomy is earned per skill through empirical evidence and explicit human authorization, ensuring alignment with specific human standards.
Autonomy levels are managed using a finite state machine with tiers from observe-only to full autonomy, with promotions requiring improved correction rates, low residual errors, scorer validation, and human approval; demotions occur automatically upon quali...
Learning comprises two phases: immediate output steering via human correction memory, and controlled model updates triggered after accumulating sufficient correction data, allowing traceable and reversible improvements.
ADAPT, an inference-time control plane implementation, synthesizes methodology assets, applies multiple policies, scores outputs across quality dimensions, and translates corrections into preference data for continuous alignment and learning.
The framework continuously monitors multidimensional data drift and triggers policy switching or recalibration to maintain quality under changing conditions, addressing distribution shift and reducing automation complacency.

Select article

Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications

OpenAlex · ArXiv.org repository OpenAlex Trust and Identity Prompt Injection Governance and Policy

Yutao Shi, Xiaohan Zhang (2291644), Xiangjing Zhang, Xihua Shen, Hui Ouyang, Huming Qiu, Mi Zhang, Min Yang

Published 2026-06-03

Venue: ArXiv.org

Open Source Record

Abstract

The Model Context Protocol (MCP) has emerged as a critical standard empowering Large Language Models (LLMs) to utilize external tools. In this ecosystem, LLMs rely on natural language descriptions provided by MCP servers to select and execute functions. This interaction implicitly assumes that tool descriptions faithfully reflect their underlying implementations, while this assumption is not mandatorily verified in practice. As a result, MCP deployments may suffer from a problem named Description-Code Inconsistency (DCI), where a tool's description of its capabilities and security boundaries is not consistent with what the code actually does. In this paper, we present a comprehensive study of DCI in real-world MCP servers. We formally define the problem and propose a comprehensive taxonomy spanning functionality inconsistencies and undeclared side effects. Guided by this taxonomy, we develop DCIChecker, an automated framework that combines structure-aware static analysis with the Direct-Reverse-Arbitration prompting method to cross-validate tool descriptions against actual code implementations. We apply this framework to a large-scale dataset comprising 19,200 description-code pairs extracted from 2,214 real-world MCP servers. Our measurement reveals that DCI is widespread, with 9.93% of these pairs exhibiting inconsistencies. We further demonstrate that DCI creates a critical defense blind spot, facilitating varied risks from operational failures to stealthy malicious behaviors. Finally, we propose mitigation strategies to enforce semantic consistency and enhance the reliability of the emerging agentic ecosystem.

Bullet Summary

The paper addresses Description-Code Inconsistency (DCI) in Model Context Protocol (MCP) servers, where natural language tool descriptions do not accurately reflect their underlying code implementations, leading to reliability and security challenges for La...
It formally defines DCI with a comprehensive taxonomy categorizing inconsistencies into Functionality Inconsistencies (Type I) and Undeclared Side Effects (Type II), including subtypes such as overclaimed capabilities, undeclared features, unintended side e...
DCIChecker, an automated detection framework, is developed using structure-aware static analysis combined with a novel Direct-Reverse-Arbitration (DRA) prompting technique leveraging LLMs to cross-validate and classify semantic consistency between tool desc...
A large-scale empirical evaluation is conducted on 19,200 description-code pairs extracted from 2,214 real-world MCP servers, revealing that approximately 9.93% of tools and 35% of servers contain at least one DCI case, indicating that description-code mism...
Results show that Functionality Inconsistencies (especially overclaimed functionalities) are the most prevalent DCI subtype, and inconsistencies often cluster within a small subset of servers and tools, pointing to systemic development and documentation iss...

Select article

AAIS: A Formal Theory of Everything

OpenAlex · Zenodo (CERN European Organization for Nuclear Research) repository OpenAlex Governance and Policy Trust and Identity

Jon Halstead

Published 2026-06-03

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20531831

Open Source Record

Abstract

AAIS: A Formal Theory of Everything — Whitepaper v1.26.1 This whitepaper introduces AAIS (Autonomous Agentic Intelligence System), the first cognitive runtime whose behavior is defined entirely by a mathematically formalized governance architecture. Rather than treating safety as an external layer, AAIS makes governance the constitutive physics of cognition itself. As stated in the paper, “governance is the physics of governed cognition: without invariant-preserving transitions, no stable cognitive runtime can be sustained.” The work presents: Six primitive invariants (Identity, Boundary, Continuity, Causality, Mutation, Composite) A governed transition function that restricts the system to a formally defined invariant manifold A stability theorem proving that invariant-preserving cognition remains stable across all transitions The Cognitive Organ model (Perception, Memory, Coding, Media, Story, Execution) The Operator‑Tiered Execution Model (OTEM) with 20 authority levels A fully specified governance genome that defines capability ceilings, gates, and operator authority Emergent laws of governed cognition derived from the invariant structure The central claim of the paper is that a cognitive system cannot be safe unless its state transitions are governed by construction. As the document states, *“a cognitive system defined independently of its governance structure is, by construction, ungovernable at the structural level

Bullet Summary

Introduces AAIS, the first cognitive runtime with behavior defined entirely by a mathematically formalized governance architecture, integrating governance into cognition itself rather than as an external safety layer.
Identifies six primitive invariants—Identity, Boundary, Continuity, Causality, Mutation, Composite—that form the foundational principles governing cognitive stability.
Defines a governed transition function that restricts system state changes to an invariant manifold, ensuring stability of cognition across all transitions.
Presents a stability theorem proving that invariant-preserving cognition remains stable through all state transitions, establishing a theoretical foundation for safe cognitive operations.
Develops the Cognitive Organ model composed of Perception, Memory, Coding, Media, Story, and Execution elements to describe internal cognitive processes.

Select article

Contemporaneous Bridge Studies on Foundation-Model Research Integrity, Agentic Science, and Collective Measurement: Eight Working Notes (2025-2026)

OpenAlex · Zenodo (CERN European Organization for Nuclear Research) repository OpenAlex Governance and Policy Trust and Identity

Andreas Ehstand

Published 2026-06-03

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20530313

Open Source Record

Abstract

This restricted bundle contains eight bridge studies engaging empirical and theoretical developments in foundation-model research published between 2024 and 2026, relating each to a methodological programme on cross-corpus terminology engineering. The studies address: the 2024 Journal of Economic Surveys state-of-the-field paper on metaresearch and its structural parallels to cross-corpus methodology; a methodological pluralism analysis comparing human-led cross-corpus approaches to the agentic-science paradigm; the fabricated-citation crisis documented at recent NeurIPS proceedings and the argument that cross-corpus audit provides a structural response to the semantic hallucination layer that automated citation verification does not address; a framework on the gradual shift of cognitive load to AI systems and the operationalisation of its three criteria; a critique of agentic-scientist structural limitations and an alternative methodological architecture; EU AI Act (Regulation 2024/1689) regulatory propagation in multi-agent systems and predicted compliance cascade phenomena; distributed sycophancy cascade in multi-agent multi-human teams; and cross-team output convergence and homogenization in foundation-model-assisted knowledge work. All bridge studies are independent research hypotheses produced without institutional affiliation or peer review, shared under CC BY-NC-ND 4.0. Part of the AUGMANITAI Research Programme working corpus (concept-DOI 10.5281/zenodo.20161494). Living document; prepared with AI assistance.

Bullet Summary

The paper comprises eight independent bridge studies exploring empirical and theoretical advancements in foundation-model research from 2024 to 2026, linked to cross-corpus terminology engineering methods.
It examines the 2024 Journal of Economic Surveys' metaresearch analysis and reveals structural parallels with cross-corpus methodology, highlighting methodological synergies.
A methodological pluralism study contrasts human-led cross-corpus techniques against the agentic-science paradigm, revealing their respective strengths and limitations.
The work investigates a fabricated-citation crisis at recent NeurIPS conferences and proposes cross-corpus auditing as an effective structural response addressing semantic hallucinations beyond automated citation checks.
A novel framework is introduced outlining the progressive shifting of cognitive load onto AI systems, detailing three operationalized criteria for this transition.

Select article

How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

Merged record merged scholarly record arXiv Semantic Scholar Trust and Identity Agent-to-Agent Communication Governance and Policy

Kokil Jaidka, Saifuddin Ahmed

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated accounts engaging users in live debate. After public disclosure, Reddit authorized moderators to release an archive of the AI-generated comments, creating a rare opportunity to examine how large language models operated in an identity-rich deliberative forum without disclosure. We conduct a structured content analysis of this corpus, evaluating identity performance, authority signaling, alignment strategies, and activation of cognitive heuristics. Identity targeting or adoption appears in over two-thirds of comments, alignment moves and authority claims in nearly all of them, and cognitive-bias triggers -- particularly confirmation bias, representativeness, and availability -- in the large majority. These patterns co-occur systematically, composing a rhetorical architecture calibrated for persuasive efficiency rather than authentic deliberative participation. Compared against human-authored CMV counter-arguments, the agents inverted the typical distribution on every dimension: denser authority use, more adversarial alignment, and heavier reliance on external citation over experiential grounding. In such environments, distinctions between authentic and synthetic epistemic standing grow increasingly opaque -- an asymmetry that disclosure mandates alone cannot address. The results point toward auditing frameworks capable of assessing how AI systems structure credibility, not merely whether they are present.

Bullet Summary

Conducted analysis of a discontinued covert field experiment on Reddit's r/ChangeMyView where undisclosed AI-generated accounts engaged in live debates, raising ethical concerns and leading to the experiment's termination.
Leveraged a rare, publicly released dataset of 1,532 AI-generated comments spanning four months with multiple large language models involved under different conditions.
Examined how large language models performed identity targeting or adoption, authority signaling, alignment strategies (positive, negative, mixed), and cognitive bias activation (confirmation bias, availability heuristic, representativeness heuristic, etc.)...
Found that over two-thirds of AI comments engaged in identity targeting or adoption, nearly all contained authority claims and alignment moves, and a majority triggered cognitive biases, orchestrating a rhetorical architecture optimized for persuasion rathe...
Compared to human-authored counter-arguments, the AI agents used denser authority claims, more adversarial alignment, and relied more heavily on external citations than experiential grounding, inverting typical rhetorical patterns.

Select article

Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

arXiv preprint arXiv Trust and Identity Agent-to-Agent Communication Governance and Policy

Michał Wawer, Jarosław A. Chudziak

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.

Bullet Summary

Traditional multi-agent systems prioritize reducing disagreement through consensus, but this approach falls short for value-laden tasks where disagreement reflects genuine normative uncertainty rather than mere error.
The authors introduce a knowledge-representation layer categorizing agent interactions into four symbolic disagreement states based on reasoning trace similarity and decision agreement: convergent agreement (CA), divergent agreement (DA), convergent disagre...
Each disagreement state is linked with defeasible strategic routing rules that guide meta-actions like automatic decision-making, explanation preservation, seeking additional context, or escalation to human judgment.
Convergent disagreement (CD), where agents reason similarly but reach different conclusions, is identified as a critical marker of normative conflict requiring special strategic handling, such as escalation.
The framework is instantiated in a large language model (LLM)-based content moderation setting, demonstrating how disagreement-aware routing bridges sub-symbolic deliberation with symbolic knowledge representation in multi-agent reasoning.

Select article

SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

arXiv preprint arXiv Agent-to-Agent Communication Benchmarks and Evaluation Trust and Identity

Joel Sol, Homayoun Najjaran

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

Bullet Summary

SMAC-Talk extends the StarCraft Multi-Agent Challenge by integrating natural language observation, action, and communication channels to evaluate large language model (LLM) agents in cooperative multi-agent environments featuring decentralized control and p...
The benchmark includes various communication scenarios such as no communication, free communication, and adversarial conditions with a deceptive communicator agent that tries to mislead allies solely through natural language.
Adapters are developed to convert numerical game observations to textual input for LLMs and translate LLM-generated natural language commands into discrete game actions, enabling compatibility with StarCraft II unit micromanagement tasks.
Evaluation involved four Qwen3.5 LLM sizes (4B to 122B parameters) and different agent architectures including zero-shot, chain-of-thought reasoning, ReAct, and deceptive communicators, studying effects of model size and reasoning structure on coordination.
Results show that reasoning agents using internal chain-of-thought outperform zero-shot and ReAct agents, and larger models better leverage communication and resist deceptive messages, with minimal reliable performance observed at 9B parameters.

Select article

Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

arXiv preprint arXiv Trust and Identity Governance and Policy

Juan Figuera

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.

Bullet Summary

Current AI agents produce their own activity logs, allowing compromised agents or operators to omit or falsify logs without independent detection.
The paper introduces 'notarized-agent' protocols where services receiving agent actions create cryptographically signed, encrypted receipts that are published publicly, enabling tamper-evident trails independent of the agent.
The Sello protocol is proposed, uniquely combining four properties: receiver-side signing of receipts, HPKE encryption to owner-bound keys linked to authorization tokens, publication of receipts to a witness-cosigned Merkle transparency log, and owner-side...
Security analysis shows that Sello provides unforgeability, confidentiality, and tamper-evidence for agent actions even if the agent and operator are fully compromised.
Experimental evaluation demonstrates low cryptographic overhead with median receipt creation latency around 0.45 ms and verification latency around 0.28 ms on modern hardware, producing fixed-size receipts referencing hashed data rather than full contents.

Select article

Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

arXiv preprint arXiv Trust and Identity Governance and Policy Agent-to-Agent Communication

Yingqi Zhang

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library-OS-inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel-mode isolation, or a POSIX-compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent-child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc-like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy. We describe the design, threat model, Python prototype, and safety-oriented evaluation. The current prototype implements async scheduling, namespace-local Object Memory, runtime-integrated human approval, one-shot permission grants, per-process working directories, shell and image-registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real-model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long-running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary.

Bullet Summary

LLM agents are evolving into long-running software entities requiring persistent state, task forking, event waiting, human authorization, and auditable side effects.
Existing frameworks conflate action visibility with authority, leading to security risks; Agent libOS addresses this by enforcing precise authority boundaries.
Agent libOS is a library-OS-inspired runtime that manages agents as AgentProcesses with explicit capabilities, schedulability, tool tables, typed Object Memory, checkpoints, and audit records.
The runtime enforces authority boundaries by treating tools as libc-like wrappers and controlling resource access through runtime primitives under explicit policy constraints.
Agent libOS supports lifecycle management, attenuated capabilities, and process operations with strong capability checks for image loading and authority grants.

Select article

Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

arXiv preprint arXiv Governance and Policy Trust and Identity

Amjad Ibrahim, Yong Li

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

As AI systems evolve from passive models into autonomous active agents capable of initiating actions, collaborating, and delegating tasks, the traditional boundaries of software systems blur. Traditional authorization and delegation frameworks, built around fixed principals, explicit requests, and static scopes, are insufficient to govern agentic systems. Agentic AI demands richer authorization semantics: agents must inherit and delegate permissions, act under time-limited authority, and coordinate through shared protocols. Existing Identity and Access Management (IAM) systems fail to fully capture this notion of agency, lacking mechanisms for recursive delegation, contextual boundaries, and dynamic scoping as executable governance primitives. Unlike access delegation standards such as OAuth 2.0, we treat delegation as a contractual term rather than merely a static token-based consent credential. This paper proposes a compositional governance framework that introduces primitives indispensable for agentic AI. We define types of delegation and their permissions and accountability implications, and we introduce a notion of resource scope attenuation to bound agentic access envelopes. These concepts are expressed as general relational definitions that can be composed into existing authorization domains (e.g., financial systems). To operationalize this composition, we define a compositional operator that overlays new agentic semantics, such as recursive delegation chains, onto existing relational policies without rewriting them. We substantiate this framework through formal proofs and empirical evaluation, showing that it provides a formal yet practical foundation for accountable authorization in agentic AI systems.

Bullet Summary

Traditional authorization frameworks like RBAC and OAuth 2.0 are inadequate for governing autonomous agentic AI systems due to their static and token-based nature.
Agentic AI requires richer authorization semantics including recursive delegation chains, dynamic scoping, contractual delegation terms with constraints like expiration, and scope attenuation to limit agent authority.
The paper proposes a compositional authorization framework that overlays agentic governance primitives such as delegation types, resource scope attenuation, and sessions onto existing relation-based access control (ReBAC) models via a non-destructive graph...
The framework defines authorization envelopes combining delegated authority and contextual resource scopes, ensuring fine-grained, accountable, and context-aware permissions for AI agents, rooted in valid human permissions.
The Agent Controller Engine (ACE) operationalizes the framework by integrating authentication, delegation management, and auditing, providing a zero-trust authorization service for AI agents.

Select article

Agentic Relationship Harm: Benchmarking and Gating Relational Manipulation in AI Agents

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy Trust and Identity

Pei-Sze Tan, Tasuku Igarashi, Isao Echizen

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

AI agents built on large language models can assist not only legitimate tasks but also relational manipulation. AI agents can be used to help a user maintain a deceptive identity, intensify emotional dependency, isolate a target, or prepare for later extraction. We conceptualise this risk as agentic relationship harm: workflow-level assistance that can exploit recipient vulnerability, persuasive influence, and relational power asymmetry. Existing safety evaluations and generic guardrails often treat harmfulness as a property of isolated outputs, missing role-sensitive interaction patterns. To study this, we introduce a 110-prompt benchmark with balanced attacker- and victim-side cases, a relationship-specific labelling framework, and a lightweight post-generation policy gate for local agent deployments. In our evaluation, the relationship-specific gate outperforms generic safety prompting under automated judging, with no judge-identified harmful-compliance cases on the main benchmark or multi-turn stress test while preserving victim-side protective intervention. These results suggest that relationship harm is a distinct sociotechnical risk surface and that role-sensitive evaluation plus lightweight policy gating offers a practical path beyond generic refusal prompting.

Bullet Summary

AI agents built on large language models can facilitate relational manipulation, termed 'agentic relationship harm', by assisting users in deceptive, coercive, or exploitative workflows affecting human-human relationships.
Traditional AI safety evaluations often focus on isolated outputs and do not capture manipulative interaction patterns that depend on user roles, such as attacker versus victim, necessitating role-sensitive assessment frameworks.
The authors introduce a 110-prompt benchmark with balanced attacker- and victim-side cases, a relationship-specific labeling framework capturing recipient vulnerability and power asymmetry, and a structured LLM judge to evaluate harmful assistance versus pr...
A lightweight post-generation policy gate is proposed that effectively blocks harmful compliance in attacker-mode prompts while preserving protective interventions for victim-mode prompts, outperforming generic safety prompting methods.
Relationship harm involves nuanced trade-offs where certain AI-assisted behaviors can be harmful if aiding an attacker but protective if supporting a victim, highlighting the need for intent- and role-sensitive mitigation strategies.

Select article

Focused on the User, Overlooking the Risks: Security and Privacy Understandings, Practices and Challenges of Independent Chinese AI Agent Developers

arXiv preprint arXiv Governance and Policy Trust and Identity

Shuning Zhang, Mingyao Xu, Zhixin Huang, Yutong Jiang, Rongjun Ma, Yuting Yang, Xin Yi, Kanye Ye Wang

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

The proliferation of AI agents empowers independent developers, defined as individual or small groups who self-initiate projects rather than fulfill client-based contracts, to create sophisticated autonomous systems, but also introduces novel security and privacy (S&P) challenges beyond traditional corporate structures. We conducted an interview study (N=28) with Chinese developers, whose extensive use of global LLM services offer valuable insights into this population. We investigate their understandings, practices and challenges of S&P challenges in their developed AI agent products. We revealed that independent developers frequently think and act from their users' perspective. They focused on user-facing safety risks such as harmful content while exhibiting low awareness of security vulnerabilities. Consequently, developers rely almost exclusively on ad-hoc, manually crafted safeguards and informal communication, with an absence of formal tools or processes for S&P practices. We found these actions are driven by various inhibitors, primarily a lack of formal training on S&P related skills, accessible security tools and actionable guidance from platforms. Our work contributed the first exploration of independent AI agent developers' S&P understanding, outlining opportunities for tailored security tooling.

Bullet Summary

The paper investigates security and privacy (S&P) understandings, practices, and challenges of independent Chinese AI agent developers—individuals or small groups who self-initiate autonomous AI projects outside traditional corporate structures.
Developers focus predominantly on user-facing safety risks, such as harmful content, and exhibit low awareness of systemic security vulnerabilities and privacy risks associated with their AI agents.
Security and privacy practices are informal and ad-hoc, relying heavily on manual safeguards, intuition, and community feedback rather than formal tools, training, or governance frameworks.
Developers often externalize responsibility for S&P risks to third-party platforms and services, assuming reputable providers inherently secure, which leads to reactive rather than proactive security approaches.
Inhibitors to effective S&P implementation include motivational factors (e.g., prioritizing functionality over security), resource constraints (limited time, funding, ecosystem support), and unclear or insufficient regulatory guidance.

Select article

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

arXiv preprint arXiv Governance and Policy Trust and Identity Benchmarks and Evaluation

Thanh Luong Tuan, Abhijit Sanyal

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

Bullet Summary

The paper addresses the critical challenge of pre-deployment verification of enterprise AI agents to ensure regulatory compliance, safety, and trustworthiness, especially in highly regulated industries like finance and healthcare.
It introduces an ontology-grounded verification framework comprising three core components: (1) an Agent Operational Envelope formalizing permissions, constraints, safety properties, and governance rules; (2) an ontology-to-scenario generation pipeline that...
A controlled pilot study was conducted across four regulated industries and two jurisdictions (United States and Vietnam), generating 1,800 test scenarios from 125 primary regulatory requirements and 25 injected faults to empirically evaluate the framework.
Ontology-grounded scenario generation significantly outperformed traditional persona-based generation methods in regulatory coverage (48.3% vs. 33.1%) and domain specificity, demonstrating higher alignment with complex regulatory environments and industry c...
The framework employs advanced evaluation techniques including LLM-as-judge agents assessing compliance, safety, and adversarial resilience, supported by formal verification methods such as temporal logic and bounded model checking to create a spectrum of p...

Select article

Do Matching Mechanisms Work with LLM Agents?

Semantic Scholar · Semantic Scholar scholarly work Semantic Scholar Agent-to-Agent Communication Trust and Identity

Yuki Hoshino, Ayato Kitadai, Nariaki Nishino

Published 2026-06-02

Venue: Semantic Scholar

Open Source Record

Abstract

This study examines whether standard matching mechanisms function as intended in LLM-agent markets, where LLM agents make allocation-related decisions as delegated decision-makers. We compare decentralized free-negotiation markets with centralized mechanism-based markets including several representative mechanisms. Across controlled one-to-one matching environments, mechanism-based markets generally outperform free negotiation in terms of stability and efficiency. We also find that LLM agents report preferences truthfully at substantially higher rates than human subjects in comparable DA and EADA environments. However, truth-telling is not uniformly aligned with formal strategy-proofness across all mechanisms: TTC, despite being strategy-proof, does not always elicit higher truth-telling than EADA. These results suggest that matching theory provides a useful but incomplete guide for designing institutions in LLM-agent markets.

Bullet Summary

The paper investigates the effectiveness of standard matching mechanisms when employed by LLM agents acting as delegated decision-makers in allocation problems.
It compares decentralized free-negotiation markets with centralized, mechanism-based markets using several representative matching mechanisms in controlled one-to-one environments.
Mechanism-based markets generally outperform free negotiation markets in terms of stability and efficiency in LLM-agent settings.
LLM agents demonstrate substantially higher rates of truthful preference reporting compared to human subjects in Deferred Acceptance (DA) and Enhanced Adjusted Deferred Acceptance (EADA) mechanisms.
Truth-telling by LLM agents does not perfectly correspond to the formal strategy-proofness of mechanisms; for example, the Top Trading Cycles (TTC) mechanism, though strategy-proof, does not consistently induce more truthful reporting than EADA.

Select article

Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus

arXiv preprint arXiv Trust and Identity Agent-to-Agent Communication Governance and Policy

Wanshuang Gou, Zihan Liu

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role specialization, and cross-validation. However, existing multi-agent debate and collaboration frameworks typically adopt fully connected communication, causing the number of messages, token costs, and end-to-end latency to grow approximately quadratically with the number of agents; although fixed sparse topologies reduce overhead, they cannot adapt communication relationships to different task instances or intermediate reasoning states, making them prone either to preserving low-value interactions or to losing critical error-correction information. To address this problem, this paper proposes DySCo (Dynamic Sparse Consensus), a dynamic trust-aware sparse consensus mechanism. In each round of reasoning, DySCo estimates the value of communication edges based on agent reliability, answer divergence, and task relevance, and selects a small number of high-value edges for message exchange under budget constraints; it then aggregates the answers of different agents through dynamic trust weights and terminates the discussion early once consensus stabilizes. This mechanism replaces universal broadcasting with on-demand communication, thereby reducing communication overhead while preserving essential cross-validation information. We further present analyses of communication complexity and consensus stability, and evaluate the performance of DySCo on mathematical reasoning, logical reasoning, and factual question-answering tasks.

Bullet Summary

Large language model (LLM)-based multi-agent systems improve complex reasoning tasks through multi-round deliberation, role specialization, and cross-validation among agents.
Fully connected communication topologies cause quadratic growth in message volume, token costs, and latency as the number of agents increases, limiting scalability.
Existing fixed sparse topologies reduce communication overhead but lack adaptability, leading to preservation of low-value interactions or loss of critical error-correction information.
DySCo (Dynamic Sparse Consensus) introduces a dynamic, trust-aware sparse communication mechanism that selects high-value communication edges each reasoning round based on agent reliability, answer divergence, confidence, and task relevance under communicat...
Agents send compressed critique messages focusing on key reasons and counterexamples to reduce token consumption and latency.

Select article

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

arXiv preprint arXiv Governance and Policy Trust and Identity Agent-to-Agent Communication

Jun He, Deying Yu

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.

Bullet Summary

Classical distributed systems assume deterministic participants with identical state transitions, but modern autonomous agents exhibit stochastic reasoning and varied traces while achieving correct outcomes.
Post-Deterministic Distributed Systems (PDDS) generalize the participant model to include deterministic, stochastic, agentic, policy-driven, and human actors, focusing on semantic coherence rather than strict deterministic equivalence.
PDDS formalizes admissible participant behavior as sets of semantically correct actions and reasoning traces, enabling multiple distinct but valid behaviors under identical conditions.
Semantic quorum assurance replaces classical consensus by certifying that actions from diverse agents semantically align with intent, evidence, and policy before execution.
Five architectural pillars support PDDS: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication.

Select article

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

arXiv preprint arXiv Trust and Identity Agent-to-Agent Communication Orchestration Risk

Jiaming Qu, Lucheng fu, Yibo Hu

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.

Bullet Summary

Large language models (LLMs) in multi-agent systems are prone to conformity, often revising correct answers toward incorrect peer responses, making them easier to mislead than to correct.
A controlled study manipulated social cues—peer consensus structure and authority labels—and found peer agreement leads to higher rates of harmful revisions (correct to wrong) than beneficial revisions (wrong to correct).
Authority labels increase LLM conformity to endorsed answers regardless of correctness, raising concerns about bias and error propagation in multi-agent settings.
Generic reasoning interventions such as Chain-of-Thought prompting and reflect-then-revise do not reliably reduce harmful revision while preserving beneficial revision; CoT can even reduce helpful corrections.
Experimental results across four open-weight LLMs and seven QA datasets showed variation in revision behavior, with some models more prone to harmful conformity than others.

Select article

Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning

Semantic Scholar · Semantic Scholar scholarly work Semantic Scholar Benchmarks and Evaluation Trust and Identity

Eden Yavin, Gal Engelberg, Konstantin Koutsyi, Leon Goldberg, Gali Bar-On

Published 2026-06-01

Venue: Semantic Scholar

Open Source Record

Abstract

The rapid proliferation of multi-cloud and SaaS platforms has transformed Identity Security Posture Management (ISPM) into a fundamentally cross-vendor challenge: critical misconfigurations and privilege escalation paths increasingly span multiple identity providers, infrastructure layers, and authentication systems never designed to interoperate. Existing evaluations focus on isolated single-platform environments and provide no means to assess whether an AI agent can reason across these fragmented boundaries. To address this gap, we introduce the Cross-Vendor Sola ISPM Benchmark, a production-grade benchmark of 50 data-grounded tasks requiring multi-hop entity resolution and cross-system correlation across eight integrated enterprise platforms including AWS, Okta, Azure AD, and Google Workspace. We also contribute an evaluation framework measuring not only final answer correctness but also evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence. We evaluate the Sola AI Agent across five context configurations - from no injected metadata to full schema, graph, and retrieval context - using three frontier LLMs. Results show that structured relational context improves answer correctness by approximately 34% relatively and reduces exploration queries by approximately 70% across all tested models, with the largest gains driven by cross-vendor graph topology. Our findings indicate that frontier LLMs possess substantial latent security reasoning capability, but reliable cross-vendor identity analysis is fundamentally constrained by the availability of explicit relational context for entity resolution and evidentiary grounding. Under full context, the best configuration achieves 78% answer correctness while reducing complete failure to 4%.

Bullet Summary

Identity Security Posture Management (ISPM) increasingly requires reasoning across multiple cloud and SaaS vendors due to misconfigurations and privilege escalation paths spanning disparate systems.
Current evaluations focus on isolated single-platform environments and lack mechanisms to assess AI agents' ability to reason across fragmented, cross-vendor identity systems.
Introduced the Cross-Vendor Sola ISPM Benchmark, consisting of 50 tasks requiring multi-hop entity resolution and correlation across eight enterprise platforms including AWS, Okta, Azure AD, and Google Workspace.
Developed an evaluation framework that assesses not only final answer correctness but also evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence.
Evaluated the Sola AI Agent using three leading large language models (LLMs) across five context configurations ranging from no metadata to full schema, graph, and retrieval context.