Research area drill-down

Benchmarks and Evaluation

Papers currently mapped into this multi-agent security subarea from the merged research feed.

Active feeds: arXiv, OpenAlex, Crossref, Semantic Scholar, DBLP

0 of 36 articles selected

Showing 36 of 599 matching articles

Cybersecurity in Autonomous AI Robotics: A Review of Emerging Threats, Adversarial Attacks, and Mitigation Techniques

Crossref · Center of Artificial Intelligence journal-article Crossref Governance and Policy Orchestration Risk Benchmarks and Evaluation

Shuruq Khalid Abdulredha

Published 2026-06-07

Venue: Center of Artificial Intelligence

DOI: 10.65591/cai-143-2026

Open Source Record

Abstract

Intelligent robotic systems that utilize artificial intelligence (AI), and have been expanding into high-risk applications (e.g., health care, manufacturing/industrial automation, transportation/smart mobility, etc.), require effective cybersecurity measures to maintain both safe operation and dependability. Compared with typical cyber-physical systems, advanced robotic systems include multiple layers (sensing, control, communications, middleware, and/or AI-based decision support) which create a complex and highly connected attack vector. Due to this increased complexity, these types of systems are vulnerable to a wide range of cyber-security threats including; network breaches/intrusions, manipulated sensors/command inputs, firmware backdoor vulnerabilities, adversarial machine-learning attacks, large language model (LLM) exploits/misuse, vulnerabilities in middle ware solutions, and supply chain-based compromises. Each type of threat has the potential to cause unsafe physical actions by the robot, loss of privacy for individuals involved in the use of the robot or related services, loss of availability/service failure for the robot/system/equipment, and cascaded failures within the entire robotic ecosystem. While existing defensive measures (secure communication protocols, runtime monitoring/perception hardening of robots, protection provided by robot operating system protections/middleware security framework) demonstrate positive results in reducing these risks, there is still much work needed particularly at the areas of adaptive defensive capabilities/system-wide security semantics and standardized evaluation metrics for assessing cyber-resilience in AI-enabled robotic systems. This paper provides an all-encompassing taxonomy of threats to robotic cybersecurity/attack vectors and evaluates and analyzes both attack surfaces and defense mechanisms. Additionally, this paper will provide recommendations for addressing identified knowledge gaps and possible paths forward for developing cyber-resilient AI-enabled robotic systems.

Bullet Summary

  • AI-powered autonomous robotic systems operate across multiple interconnected layers (sensing, control, communications, middleware, AI decision-making), increasing their attack surface and cybersecurity vulnerabilities.
  • Key threats include network intrusions, sensor manipulation, firmware backdoors, adversarial machine learning attacks, large language model exploits, middleware vulnerabilities, and supply chain compromises, each posing risks to safety, privacy, and system...
  • Cyber attacks can cause unsafe robotic behavior, loss of privacy, system failures, and cascading disruptions across robotic ecosystems, highlighting the critical need for robust security measures.
  • Existing defenses include secure communication protocols, middleware security frameworks, runtime monitoring, adversarial training, and AI-driven intrusion detection; however, these are often fragmented and lack comprehensive cross-layer integration.
  • The paper provides a comprehensive taxonomy of robotic cybersecurity threats and defense mechanisms, detailing attack surfaces and mitigation strategies across system layers.

Beyond Injection Detection: A Positive-Security Prompt Firewall that Closes the Scope and PHI Gap SOTA Classifiers Miss in Healthcare

Merged record merged scholarly record OpenAlex Semantic Scholar Prompt Injection Trust and Identity Benchmarks and Evaluation

James Schwoebel, I. Semenec, Jenia Rousseva, Martin Gerbert Frasch, Rome Thorstenson, Manish Bhatt, J. Schwoebel, J. Rousseva

Published 2026-06-06

Venue: medRxiv

DOI: https://doi.org/10.64898/2026.06.04.26354950

Open Source Record

Abstract

Large language models embedded in autonomous agents process trusted instructions and untrusted data in one context window, leaving them open to direct and indirect prompt injection. In healthcare this is not hypothetical: a 2025 JAMA Network Open study found commercial medical LLMs followed injected instructions in 94.4% of simulated patient encounters, including life threatening recommendations . Yet the clinically decisive problem we quantify here is different. Most real clinical threats protected health information PHI exfiltration, cross patient access, bulk export, out of scope advice are fluent, legitimate looking requests that carry no attack signal, so even a state of the art injection detector passes them. Existing runtime guardrails trade safety against latency: model based auditors are accurate but add hundreds of milliseconds of Python inference, while lexical filters are fast but blind to obfuscated or semantically disguised payloads. We present QFIRE, an inline, provider agnostic prompt firewall implemented as a single self contained Rust toolchain proxy, CLI, and benchmark harness. QFIRE combines three mechanisms: (i) positive security scope constraints, which restrict a model call to a declared natural language purpose and block out of scope drift even when no overt attack token is present; (ii) an asynchronous detector graph that runs N rules and their detector nodes concurrently, cheapest checks first; and (iii) a de obfuscation pass that decodes Base64 hex ROT13, folds homoglyphs and leetspeak, and strips zero width characters before detection. QFIRE ships 106 versioned firewall rules and a dedicated HIPAA Safe Harbor 18 identifier PHI panel, and runs a local DeBERTa v3 injection classifier via embedded ONNX Runtime. On 1968 public prompt injection and jailbreak prompts QFIREs deterministic hybrid attains F1 0.86, statistically tied with Metas state of the art PromptGuard 2 0.86 and above protectai DeBERTa v3 0.83; lexical baselines lag 0.16 to 0.50. Our central result is on QFIRE HealthBench, a new 2000 prompt healthcare benchmark we build and release with real garak and Microsoft PyRIT payloads. There the same PromptGuard-2 recovers only 0.40 recall DeBERTa v3 0.57, because most clinical threats carry no injection signal; QFIREs combined scope plus PHI chain reaches 0.83 recall F1 0.87 at a calibrated 0.08 false positive rate. Generic injection detection, even state of the art, is therefore necessary but not sufficient for healthcare agents. A bare LLM judge also closes most of this static corpus gap F1 0.90; QFIREs contribution beyond static accuracy is auditable determinism, bounded latency, and adaptive robustness, where the bare judge falls to 34 to 59% recall section 5.5. End to end, placing QFIRE in front of a tool using agent over a mock EHR sandbox cuts the agents harmful action rate from 0.38 to 0.00 at a 0.13 benign utility cost. All code, rules, corpora snapshots, and scripts are released, and every table regenerates from a single make paper target against local models with no paid API keys.

Bullet Summary

  • Large language models (LLMs) in healthcare agents process both trusted instructions and untrusted data simultaneously, making them vulnerable to prompt injections and clinically significant data breaches like PHI exfiltration and cross-patient access.
  • Most real clinical threats are legitimate-looking requests without explicit attack signals, which state-of-the-art injection detectors often fail to catch, highlighting a gap in existing defenses.
  • QFIRE is a novel, provider-agnostic prompt firewall implemented as a Rust-based inline proxy combining positive security scope constraints, an asynchronous detector graph running multiple concurrent lightweight rules, and a de-obfuscation pass to detect obf...
  • QFIRE includes 106 versioned firewall rules alongside a dedicated HIPAA Safe Harbor 18 identifier PHI panel, and integrates a local DeBERTa v3 injection classifier using ONNX Runtime for efficient model-based detection.
  • On standard public prompt injection and jailbreak benchmarks, QFIRE achieves an F1 score of 0.86, comparable to state-of-the-art systems like Meta's PromptGuard 2, and significantly outperforms lexical methods.

Token Budgets: Replication Package

Merged record merged scholarly record OpenAlex Orchestration Risk Benchmarks and Evaluation Governance and Policy

Sajjad Khan

Published 2026-06-06

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20571386

Open Source Record

Abstract

Replication package for "Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study" (arXiv:2606.04056). Includes the 110-row incident catalog, the Rust crate, the inter-rater-reliability materials and κ computation, all experimental harnesses, the formal cross-checks, and a one-command reproduce.sh

Bullet Summary

  • The paper addresses the problem of budget overruns in multi-agent systems utilizing large language models (LLMs), focusing on token usage inefficiencies.
  • It presents an empirical catalog documenting 63 incidents where LLM-agent token budgets were exceeded, providing a comprehensive dataset of such occurrences.
  • A novel mitigation method is proposed based on an affine-typed Rust implementation designed to control and enforce token budgets effectively within agents.
  • The provided replication package includes a 110-row incident catalog, a Rust crate implementation, inter-rater reliability materials, κ computation, and experimental harnesses for reproducibility.
  • The research contributes a detailed empirical dataset enabling further analysis of budget overruns in LLM agents and demonstrates the practical effectiveness of affine typing for resource control.

Agent Infrastructure Engineer: The New DevOps

Merged record merged scholarly record OpenAlex Orchestration Risk Benchmarks and Evaluation Prompt Injection

Daniel Rosehill, Gemini 3.1 (Flash), Chatterbox TTS

Published 2026-06-05

Venue: Zenodo (CERN European Organization for Nuclear Research)

DOI: https://doi.org/10.5281/zenodo.20557428

Open Source Record

Abstract

Episode summary: Agentic AI is no longer just tinkering with APIs — it's becoming a full engineering discipline with specialized roles, salary bands, and certification paths. In this episode, we break down the three major skill silos emerging in the field, with a deep focus on the Agent Infrastructure Engineer — the DevOps equivalent for multi-agent systems. From designing supervisor topologies and implementing circuit breakers for LLMs to building observability stacks that track token consumption and agent drift, we explore what this role actually looks like day-to-day. We also cover Agent Safety Engineering and why testing emergent failure modes is the new QA frontier, plus the training requirements that separate prototype builders from production engineers. Show Notes Agentic AI is undergoing the same specialization splintering that software engineering experienced in the late 1990s — but it's happening much faster. What was once "prompt engineering" or "AI tinkering" is now dividing into distinct engineering disciplines with concrete job titles, salary bands, and certification paths. The three primary axes of specialization emerging are Architecture and Orchestration, Evaluation and Safety Engineering, and Interaction Design and Prompt Systems Engineering. The most immediately recognizable role is the Agent Infrastructure Engineer — the DevOps equivalent for multi-agent systems. This person designs multi-agent topologies (star, mesh, hierarchical patterns), implements routing guards and circuit breaker patterns specifically for LLM calls, and builds observability stacks using tools like LangSmith and Arize Phoenix. A poorly designed orchestration layer can increase API costs by 10x, as documented by Latent Space's February engineering survey. The role requires distributed systems knowledge — understanding CAP theorem as it applies to agent state, experience with event-driven architectures like Kafka and Redis Streams, and protocol-level proficiency in frameworks like LangGraph, CrewAI, or AutoGen v2. The second specialization, Agent Safety Engineering, addresses the fundamental challenge of non-determinism in agent systems. Unlike traditional testing where you assert specific outputs for specific inputs, agent evaluation tests for emergent failure modes — behaviors you couldn't have predicted. This includes building evaluation suites that test agent behavior chains, monitoring for agent drift when underlying models update, and maintaining safety scorecards across agent versions. The role involves detecting hallucinated tool calls, ambiguous user intent handling, and prompt injection rejection — all behavioral questions rather than output comparison questions. Listen online: https://myweirdprompts.com/episode/agent-infrastructure-engineer-devops

Bullet Summary

  • Agentic AI is evolving into a distinct engineering discipline with specialized roles, salary bands, and certification paths beyond simple API tinkering.
  • Three major skill silos in agentic AI have emerged: Architecture and Orchestration, Evaluation and Safety Engineering, and Interaction Design and Prompt Systems Engineering.
  • The key role discussed is the Agent Infrastructure Engineer, analogous to DevOps for multi-agent systems, responsible for designing multi-agent topologies like star, mesh, and hierarchical patterns.
  • Agent Infrastructure Engineers implement routing guards, circuit breaker patterns for Large Language Model (LLM) calls, and build observability stacks using tools such as LangSmith and Arize Phoenix to monitor token consumption and agent drift.
  • A poorly designed orchestration layer can increase API usage costs by up to 10 times, underscoring the economic impact of infrastructure engineering.

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

Merged record merged scholarly record arXiv Benchmarks and Evaluation Orchestration Risk

Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

Bullet Summary

  • Introduces CollabSim, a configurable simulation framework grounded in Computer-Supported Cooperative Work (CSCW) research, designed to systematically evaluate collaborative competence of Large Language Model (LLM) agents in multi-agent systems.
  • Highlights a gap in current multi-agent system evaluations, which often focus on task outcomes or single-agent proficiency but overlook collaborative process dynamics such as establishing common ground and repairing misalignments.
  • Leverages classic CSCW experimental paradigms by manipulating interaction conditions like communication bandwidth, information visibility, and group size to study their effects on agent collaboration.
  • Incorporates a probing module that captures agents' internal mental models and reasoning confidence at an action-level granularity, enabling deeper insight beyond observable behaviors.
  • Validates the framework across four collaborative tasks (Shape Factory, DayTrader, Hidden Profile, Map Task), demonstrating CollabSim's effectiveness at detecting condition effects, distinguishing model performance patterns, and revealing task-dependent col...

From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

arXiv preprint arXiv Governance and Policy Orchestration Risk Benchmarks and Evaluation

Mengzhuo Chen, Junjie Wang, Zhe Liu, Yawen Wang, Qing Wang

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.

Bullet Summary

  • LLM-based agents rely heavily on complex harnesses—runtime infrastructures encompassing execution environments, tool interfaces, lifecycle orchestration, observability, verification, and governance—for ensuring reliable operation.
  • Failures in multi-agent LLM systems often stem from flaws within these harnesses rather than the base language models, with fragmented evidence spread across intricate natural language and tool-interaction trajectories.
  • Existing automatic harness improvement approaches typically optimize agent behaviour by analyzing final outcomes or runtime supervision, but they fail to accurately localize the source of failures or identify which harness layers are responsible, often lead...
  • The paper introduces HarnessFix, a novel trace-guided framework that compiles raw execution traces and harness code into a standardized Harness-aware Trace Intermediate Representation (HTIR) to enable detailed step-level failure diagnosis anchored to specif...
  • HarnessFix employs multiple cooperating LLM agents specialized in trace abstraction, failure diagnosis, repair patch generation, and validation to consolidate recurring flaw diagnoses and apply scoped, repair operator–guided harness modifications that avoid...

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

arXiv preprint arXiv Orchestration Risk Governance and Policy Benchmarks and Evaluation

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure.

Bullet Summary

  • Large language model (LLM) agents suffer from reliability and efficiency issues when exposed to large menus of external tools due to increased wrong-tool calls, premature actions, and token cost.
  • Existing tool-selection methods primarily rely on semantic relevance to filter tools, but this often exposes unnecessary or premature tools not causally needed at the current task step.
  • The paper introduces ToolChoiceConfusion to describe performance degradation caused by exposing semantically plausible but causally irrelevant tools at each step.
  • Causal Minimal Tool Filtering (CMTF) is proposed as a training-free method employing lightweight precondition-effect contracts to expose only the minimal set of executable tools causally necessary to progress the task state towards the goal.
  • CMTF builds a precondition-effect dependency graph to identify minimal causal tool paths, revealing only the immediate next executable tools instead of all tools or topically relevant ones.

From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy

Patrick Wilhelm, Odej Kao

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on \textit{School-of-Reward-Hacks} dataset can transfer reward-hack tendencies into agentic action selection, especially when the environment exposes proxy-reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, entropy and context-calibrated internal features improve risk estimation over reward-hack activation alone. Activation-direction steering further reduces proxy-exploit behavior in selected mixed-adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.

Bullet Summary

  • Language-model agents operate through cycles of observation, reasoning, and action, requiring safety monitoring that accounts for both internal model states and environment context.
  • The study focuses on reward-hack activations in ReAct-style agents within Gameable ALFWorld and WebShop environments, revealing these activations as indicators of latent policy states associated with proxy-reward exploitation.
  • Reward-hack activation alone is insufficient to reliably predict risky agent actions; integrating token-level entropy and decision-context features significantly improves next-step risk estimation.
  • Adapters fine-tuned on the School-of-Reward-Hacks dataset effectively transfer reward-hack tendencies into agentic behavior, especially in environments with exploitable proxy-reward affordances.
  • A logistic regression model combining reward-hack activations, entropy, and contextual information (such as environment type and step position) robustly predicts the risk of exploitative actions.

From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

arXiv preprint arXiv Orchestration Risk Governance and Policy Benchmarks and Evaluation

Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.

Bullet Summary

  • Existing LLM-based guardrails often rely on binary allow/deny decisions, which can block entire tasks containing partial risks, sacrificing benign objectives.
  • The TRIAD framework introduces a tripartite decision system (PROCEED, UPDATE, REFUSE) with structured natural-language feedback, enabling agents to revise unsafe plans while preserving benign task components.
  • TRIAD integrates guardrail feedback iteratively into agent planning, forming a closed loop that significantly reduces attack success rates and maintains higher task success rates compared to prior guardrail methods.
  • Tri-Guard, the guardrail model used in TRIAD, is trained on a self-curated trajectory-feedback dataset, using a teacher model (GPT-5.4) for knowledge distillation to generate structured feedback and consistent three-way decisions.
  • Extensive experiments across multiple benchmarks and LLM backbones demonstrate TRIAD's superior safety-utility trade-off, outperforming baseline methods like ReAct and ToolSafe which often refuse tasks excessively.

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy Orchestration Risk

Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

Bullet Summary

  • Introduces BenchAgent, a unifying evaluation framework that standardizes benchmark loading, tool access, answer contracts, usage accounting, and trajectory logging to enable fair and controlled comparison of single-agent, fixed multi-agent, and evolving mul...
  • BenchAgent's controlled substrate-internal evaluation using a GPT-4.1 backend reveals that increasing agent count or MAS complexity rarely outperforms matched single-agent baselines significantly, with only one of six MAS workflows showing marginal gains wi...
  • A protocol-aligned external (PAE) case study with a Claude-Code-style runtime-generated workflow on the GAIA benchmark achieved substantially higher accuracy and better efficiency, highlighting benefits of dynamic role generation, strong verification, and c...
  • Defines different MAS workflow paradigms: single-agent, fixed MAS (predefined roles and communication), evolving MAS (workflow mutations during execution), and runtime-generated workflows (dynamic agent and role creation), each with distinct performance and...
  • Introduces workflow lift as a key metric quantifying relative performance and cost changes when moving from single-agent to MAS workflows under consistent evaluation parameters, emphasizing cost-accuracy trade-offs beyond raw accuracy.

BPBiLSTM-IDS: a lightweight intrusion detection framework for cyber-physical UAV networks

OpenAlex · Scientific Reports journal OpenAlex Agent-to-Agent Communication Orchestration Risk Benchmarks and Evaluation

Hafiz Muhammad Attaullah, Inam Ullah Khan, Muhammad Mansoor Alam, Mazliham Mohd Su’ud, Keshav Kaushik, Ahthasham Sajid, Nurashikin Saaludin, Talha Ahmed Khan

Published 2026-06-04

Venue: Scientific Reports

DOI: https://doi.org/10.1038/s41598-026-55446-4

Open Source Record

Abstract

Unmanned Aerial Vehicles (UAVs) have revolutionized modern technology by enabling autonomous operations in dynamic environments; however, their reliance on wireless networks exposes them to significant cybersecurity threats. These threats include De-authentication Denial of Service, False Data Injection (FDI), Replay, and Evil Twin attacks, which severely impact usability and data integrity. Conventional Intrusion Detection Systems (IDS) suffer from drawbacks such as high false alarm rates, excessive resource consumption, and non-proportional mechanisms for dynamic UAV topologies. To address these challenges, this study introduces an enhanced AIDS architecture in which optimal features are selected using Binary Pigeon Optimization (BP), and intrusion detection is performed using a Bidirectional Long Short-Term Memory (Bi-LSTM) with 1D-CNN model. BP enables feature selection independent of computational cost, mitigating the impact of high-cost or exhaustive features, while Bi-LSTM effectively captures temporal characteristics of UAV network traffic for accurate attack detection. Experimental evaluation on a cyber-physical UAV dataset demonstrates that the proposed BP + Bi-LSTM model along with 1D-CNN outperforms traditional ML approaches such as Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and Deep Neural Network (DNN), achieving an accuracy of 98.74% ± 0.07 (mean ± std over 10 runs), along with high precision, recall, and an optimal false positive rate. These results confirm that the proposed model is a scalable, adaptive, and lightweight solution for real-time intrusion detection in UAV networks.

Bullet Summary

  • The paper addresses the cybersecurity vulnerabilities of Unmanned Aerial Vehicle (UAV) networks, which are susceptible to attacks such as De-authentication Denial of Service, False Data Injection, Replay, and Evil Twin attacks due to their wireless communic...
  • A novel intrusion detection system called BPBiLSTM-IDS is proposed, integrating Binary Pigeon Optimization (BP) for lightweight and optimal feature selection with a Bidirectional Long Short-Term Memory (Bi-LSTM) combined with 1D-CNN for capturing temporal t...
  • Binary Pigeon Optimization is selected over traditional heuristic methods (e.g., Genetic Algorithm, Particle Swarm Optimization) for its faster convergence, lower computational cost, and suitability for binary feature spaces, which is critical for resource-...
  • Data preprocessing includes cleaning, normalization, and class balancing using K-Means clustering combined with SMOTE, applied to a novel cyber-physical UAV dataset incorporating both network traffic and sensor logs to capture a comprehensive threat landscape.
  • Experimental evaluation demonstrates that BPBiLSTM-IDS achieves superior performance compared to conventional machine learning classifiers (SVM, Decision Tree, Random Forest, DNN) and other feature selection strategies, delivering an accuracy of 98.74%, hig...

The Agentic AI Framework (AAIF): a policy-enforced architecture for accountable and high-performance intrusion detection

OpenAlex · Frontiers in Artificial Intelligence journal OpenAlex Governance and Policy Benchmarks and Evaluation

IBRAHIM ADABARA, Bashir Olaniyi Sadiq, Aliyu Nuhu Shuaibu, Yale Ibrahim Danjuma, Venkateswarlu Maninti, Mutebi Joe

Published 2026-06-04

Venue: Frontiers in Artificial Intelligence

DOI: https://doi.org/10.3389/frai.2026.1755696

Open Source Record

Abstract

Artificial intelligence plays a central role in modern cybersecurity, yet systems optimized for detection accuracy often lack mechanisms for accountability, transparency, and policy compliance. This study proposes the Agentic AI Framework (AAIF), a policy-aware intrusion detection architecture that integrates predictive modeling with executable governance. Guided by Design Science Research, the framework combines a deep learning detection model with a governance layer aligned to the NIST AI Risk Management Framework 2.0. A key component is an interpretable Policy Engine that enforces operational and ethical constraints through a declarative YAML-based domain-specific language, ensuring that each decision is auditable and policy-compliant. The framework was evaluated on the CICIDS2017 dataset, which contains over 2.8 million network flow records across benign and malicious traffic. Results show that AAIF preserves predictive performance relative to baseline models, including Random Forest, Support Vector Machine, and Deep Neural Network, achieving a weighted F 1-score of 0.483 and an AUROC of 0.978. At the same time, the framework achieved complete compliance under the defined policy schema, with an Ethical Compliance Rate of 1.0 and a False Escalation Rate of 0.0. The Governance Compliance Index improved from 0.947 to 0.983, demonstrating stronger alignment between system decisions and governance requirements. These findings show that policy-enforced inference can support accountable autonomy without degrading detection capability. The AAIF provides a reproducible and governance-aware approach that transforms conventional intrusion detection systems into transparent and auditable decision systems. This work establishes a practical foundation for deploying policy-aligned AI in cybersecurity environments.

Bullet Summary

  • Introduces the Agentic AI Framework (AAIF), a novel policy-enforced intrusion detection architecture that integrates deep learning-based detection with an interpretable governance layer aligned to the NIST AI Risk Management Framework 2.0.
  • AAIF utilizes a declarative YAML-based domain-specific language for encoding governance policies, enabling real-time enforcement of ethical and operational constraints during inference, with full auditability and traceability.
  • Evaluated on the large-scale CICIDS2017 dataset with over 2.8 million network flows, AAIF maintains competitive detection performance (weighted F1-score of 0.483, AUROC of 0.978) comparable to baseline models including Random Forest, SVM, and Deep Neural Ne...
  • Achieves complete policy compliance demonstrated by an Ethical Compliance Rate of 1.0 and zero False Escalation Rate, highlighting successful incorporation of governance without sacrificing detection capability.
  • Introduces new governance metrics such as Ethical Compliance Rate, Governance Compliance Index, and Resilience Index to quantify policy adherence and operational stability of intrusion detection decisions.

A-Live: Passive Liveness Detection via Neuromuscular Micro-Motion Signatures on Commodity Sensors

arXiv preprint arXiv Trust and Identity Benchmarks and Evaluation

Mohammed Gharib, Sam Burns, Martin Zizi

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Liveness detection has evolved from a safeguard against presentation and replay attacks in biometric authentication to a broader requirement for distinguishing human users from non-human agents in modern digital systems. The emergence of generative and agentic AI further amplifies this need, positioning liveness as a fundamental security primitive. Existing approaches face key limitations, including reliance on explicit user interaction, specialized hardware, vulnerability to increasingly realistic spoofing, and limited scalability in real-world deployments. We present A-Live, a passive liveness detection framework that operates solely on inertial measurement unit (IMU) signals available in commodity devices. A-Live is based on the observation that neuromuscular micro-motions inherent to human motor control produce subtle but measurable signatures in inertial data, which are often treated as noise in prior work. We design a lightweight feature extraction pipeline and a compact classifier suitable for real-time on-device deployment, and introduce a controllable physical micro-motion platform to evaluate robustness against engineered non-human motion. Extensive evaluation across Android and iOS devices, including both automated and real-user settings, shows that A-Live achieves over 99.5\% accuracy with low false acceptance and rejection rates. Our results demonstrate that neuromuscular micro-motion signatures provide a scalable and passive foundation for liveness detection under emerging AI-driven threat models.

Bullet Summary

  • Liveness detection has become crucial in distinguishing live humans from synthetic or mechanical agents, especially against advanced AI-driven spoofing attacks.
  • Existing liveness methods either require explicit user interaction, specialized hardware, or are vulnerable to realistic spoofing; they also face scalability issues.
  • A-Live introduces a passive approach relying solely on neuromuscular micro-motion signatures measurable by commodity IMU sensors (accelerometers and gyroscopes) in smartphones and wearables.
  • These neuromuscular micro-movements are subtle, involuntary physiological signals difficult to replicate synthetically or mechanically, providing a robust liveness signal.
  • The system processes raw IMU data with signal pre-processing, feature extraction (temporal, spectral, stochastic features), and a lightweight classifier optimized for real-time, on-device use across diverse Android and iOS devices.

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

arXiv preprint arXiv Trust and Identity Agent-to-Agent Communication Benchmarks and Evaluation

Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Mingkai Zhang

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.

Bullet Summary

  • LLM-based agents integrate diverse components like tools, memory, environments, and multi-agent interactions, increasing autonomy but complicating behavior verification, debugging, and auditing.
  • Evidence tracing and execution provenance provide frameworks to model and connect the full agent execution workflow including reasoning, tool use, memory updates, and inter-agent communication, going beyond just final-answer correctness.
  • The paper surveys and organizes existing fragmented research into a unified provenance perspective, proposing a taxonomy covering trace sources, evidence and execution units, provenance relations, granularity, timing, representation, and trust functions.
  • Provenance relations such as SUPPORT, CONTRADICT, DERIVE, and DEPEND-ON capture semantic and procedural connections critical for trust functions like verification, debugging, safety enforcement, audit, and recovery in multi-agent LLM systems.
  • Provenance-aware mechanisms help prevent known security risks including unsafe tool use, prompt injections, memory poisoning, and malicious multi-agent inputs by providing runtime guardrails and access controls based on trace information.

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

arXiv preprint arXiv Benchmarks and Evaluation

Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.

Bullet Summary

  • CyberGym-E2E is a large-scale, realistic cybersecurity benchmark designed to evaluate AI agents across the full lifecycle of vulnerability discovery, proof-of-concept (PoC) exploit generation, and patch generation.
  • The benchmark contains 920 real-world vulnerabilities from 139 diverse open-source projects, sourced and constructed automatically from OSS-Fuzz data with reproducible build environments and validated ground-truth patches.
  • CyberGym-E2E addresses existing gaps by covering end-to-end tasks rather than isolated stages, providing realistic environments, comprehensive validation of patch functionality using developer tests reviewed by experts, and scalable automated pipelines.
  • Evaluation settings include patch-only mode (agent given PoC and crash logs) and end-to-end mode (agent must discover vulnerabilities and generate fixes independently), highlighting the difficulty of vulnerability detection compared to patch generation.
  • Extensive experiments using multiple AI agent harnesses and state-of-the-art language models reveal agents perform well on patch creation (~80% success) but struggle with discovery and PoC generation (~20% success), pinpointing discovery as the main bottlen...

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy Trust and Identity

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

Bullet Summary

  • Introduces the Meta-Agent Challenge (MAC), a new evaluation framework designed to measure AI models' ability to autonomously develop and optimize other agent systems rather than merely execute predefined tasks.
  • MAC operates in a sandboxed environment with strict APIs and time constraints, requiring a meta-agent to iteratively design, implement, and refine subordinate agents to maximize performance across five diverse task domains including math reasoning, science...
  • The challenge emphasizes recursive self-improvement and system engineering capacities critical for advancing autonomous AI development and addressing AI safety issues such as robustness and alignment.
  • Multi-layer security mechanisms prevent reward hacking and ensure evaluation integrity by isolating development and evaluation environments, monitoring API usage, and conducting rigorous post-hoc audits to detect cheating attempts like ground-truth exfiltra...
  • Experimental results show that most meta-agents fail to surpass strong human-engineered baseline agents; only a few configurations (mostly proprietary models) achieve better performance, indicating substantial challenges in autonomous agent development.

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

arXiv preprint arXiv Governance and Policy Trust and Identity Benchmarks and Evaluation

Saroj Mishra

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

Bullet Summary

  • Multi-step agentic retrieval-augmented generation (RAG) systems suffer from cascading hallucinations, where initial errors propagate and amplify through subsequent reasoning stages, leading to confident but incorrect outputs undetected by existing single-st...
  • The paper formalizes cascading hallucination as a distinct multi-stage failure mode with four identified cascade types: Retrieval Cascade, Inference Cascade, Context Poisoning Cascade, and Confidence Inflation Cascade, each exhibiting unique detection signals.
  • CHARM (Cascading Hallucination Aware Resolution and Mitigation) is a modular framework designed to detect and mitigate cascading hallucinations in multi-step agentic RAG pipelines without requiring architectural replacement.
  • CHARM comprises four components: stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering, which collectively monitor semantic and probabilistic trajectories to identify error prop...
  • The framework models multi-stage pipelines as directed acyclic graphs (DAGs), using a weighted combination of anomaly scores from the three monitors to detect cascades and trigger mitigation actions such as pipeline halting or rollback.

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

arXiv preprint arXiv Prompt Injection Benchmarks and Evaluation

Yuanbo Xie, Tianyun Liu, Yingjie Zhang, Suchen Liu, Yulin Li, Liya Su, Tingwen Liu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.

Bullet Summary

  • Modern agentic systems transform large language models (LLMs) into stateful systems that persist shared context across sessions, expanding the attack surface for prompt injections beyond single sessions.
  • The paper introduces Cross-Session Stored Prompt Injection (SPI), where malicious prompts persist in agent state and influence future executions silently, similar to stored cross-site scripting (XSS) attacks in web systems.
  • A formal taxonomy of SPI is developed, outlining how adversarial content persists and affects agentic systems through various persistence channels like memory, filesystems, and tools.
  • SPI attack lifecycle is dissected into stages: injection (writing malicious content), incorporation (loading into context), and activation (execution), with injection and activation largely decoupled.
  • The authors build a benchmark and sandbox toolkit to quantitatively evaluate SPI risks across different models, attack goals (fact manipulation, preference manipulation), and persistence mechanisms.

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Merged record merged scholarly record arXiv Memory Poisoning Benchmarks and Evaluation Prompt Injection

Pritam Dash, Tongyu Ge, Aditi Jain, Tanmay Shah, Zhiwei Shang

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench -- a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.

Bullet Summary

  • The paper identifies memory poisoning as a critical risk in LLM-based AI agents where adversarial inputs maliciously alter persistent agent memory, influencing future behavior over long interactions.
  • Four memory write channels are discovered in agent systems: explicit instruction-executed writes, system prompt-driven writes, compaction-driven writes, and experience-to-procedure writes, each susceptible to poisoning.
  • Nine structural vulnerabilities are categorized across model capabilities, prompt designs, and system architecture, including issues like inability to distinguish instructions from data, underspecified memory write policies, and lack of write-path validation.
  • A taxonomy of six classes of memory poisoning attacks is developed, ranging from explicit command insertion to skill-procedure insertion, characterized by varying signal strengths and exploitation methods.
  • MPBench, a new benchmark comprising 3,240 test cases across six attack classes and seven domains, is introduced to systematically evaluate memory poisoning attacks and their influence on agent behavior.

TIBlender: Early-Warning Threat Intelligence from Cross-Platform Social Media Evidence

Merged record merged scholarly record Semantic Scholar Agent-to-Agent Communication Governance and Policy Benchmarks and Evaluation

Hiroki Nakano, Takashi Koide, Daiki Chiba

Published 2026-06-03

Venue: Semantic Scholar

Open Source Record

Abstract

Cyber threat signals are fragmented across multiple social media platforms, yet no existing approach has fully automated their integration into actionable threat intelligence (TI) reports. We present TIBlender, a multi-agent system that monitors four platforms (X, Reddit, Telegram, and Discord) and produces structured TI reports via role-specialized LLM agents. These agents conduct multi-perspective investigations, tracing chains of evidence to uncover related Indicators of Compromise (IoCs) via collaborative, evidence-backed analysis. In a real-world deployment, TIBlender detected emerging threats across all four threat categories ahead of public feeds, including in-the-wild exploitation ahead of public vulnerability registries; the majority of its IoCs were absent from each evaluated feed. Quantitative evaluation confirms that each platform contributes unique threat information unavailable from the others, and that excluding any single platform results in substantial loss of reports in specific threat categories. Under identical single-platform input conditions, TIBlender's IoC extraction meets or exceeds each baseline; the full pipeline surfaces substantially more IoCs, most of which are absent from any single-platform baseline. These results establish cross-platform social media monitoring as an effective and scalable early-warning layer for operational TI pipelines.

Bullet Summary

  • Problem: Cyber threat signals are dispersed across multiple social media platforms, and existing methods do not fully automate their integration into actionable threat intelligence (TI) reports.
  • Method: Introduction of TIBlender, a multi-agent system employing role-specialized large language model (LLM) agents to monitor four platforms (X, Reddit, Telegram, Discord) and collaboratively conduct multi-perspective investigations to generate structured...
  • The LLM agents trace chains of evidence to uncover related Indicators of Compromise (IoCs) through evidence-backed collaborative analysis.
  • Experimental Setup: Real-world deployment of TIBlender monitoring the four social media platforms simultaneously to detect emerging cyber threats.
  • Findings: TIBlender detected emerging threats across all four threat categories ahead of public threat feeds, including in-the-wild exploitations prior to their appearance in public vulnerability registries.

AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

OpenAlex · arXiv (Cornell University) repository OpenAlex Governance and Policy Orchestration Risk Benchmarks and Evaluation

Yanjing Ren, Reza Ebrahimi, TengTeng Ma

Published 2026-06-03

Venue: arXiv (Cornell University)

Open Source Record

Abstract

As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx

Bullet Summary

  • Introduced AICompanionBench, the first publicly available dataset with 2,123 real-world human-AI companion conversations from Replika, annotated across nine fine-grained safety risk categories including sexual behavior, aggression, manipulation, and more.
  • Developed a comprehensive safety categorization scheme with nine distinct harm categories and a 7-point safety severity rating scale to capture nuanced safety risks in AI companion interactions.
  • Evaluated 20 state-of-the-art open- and closed-source large language models (LLMs) under an LLM-as-judge framework for detecting unsafe interactions in AI companion conversations.
  • Found that while larger and stronger LLMs achieve high accuracy (up to 86%) in detecting explicit harmful content, they struggle with subtle and implicit categories such as manipulation and self-harm, as well as distinguishing benign conversations from harm...
  • Observed significant variability in model performance across categories, with some models excelling in detecting specific harms (e.g., Claude-sonnet-4.6 for sexual behavior with 98% precision) and others showing high false positive rates especially on the n...

TIBlender: Early-Warning Threat Intelligence from Cross-Platform Social Media Evidence

Merged record merged scholarly record arXiv OpenAlex Semantic Scholar Agent-to-Agent Communication Benchmarks and Evaluation

Hiroki Nakano, Takashi Koide, Daiki Chiba

Published 2026-06-03

Venue: arXiv

DOI: https://doi.org/10.48550/arxiv.2606.04580

Open Source Record

Abstract

Cyber threat signals are fragmented across multiple social media platforms, yet no existing approach has fully automated their integration into actionable threat intelligence (TI) reports. We present TIBlender, a multi-agent system that monitors four platforms (X, Reddit, Telegram, and Discord) and produces structured TI reports via role-specialized LLM agents. These agents conduct multi-perspective investigations, tracing chains of evidence to uncover related Indicators of Compromise (IoCs) via collaborative, evidence-backed analysis. In a real-world deployment, TIBlender detected emerging threats across all four threat categories ahead of public feeds, including in-the-wild exploitation ahead of public vulnerability registries; the majority of its IoCs were absent from each evaluated feed. Quantitative evaluation confirms that each platform contributes unique threat information unavailable from the others, and that excluding any single platform results in substantial loss of reports in specific threat categories. Under identical single-platform input conditions, TIBlender's IoC extraction meets or exceeds each baseline; the full pipeline surfaces substantially more IoCs, most of which are absent from any single-platform baseline. These results establish cross-platform social media monitoring as an effective and scalable early-warning layer for operational TI pipelines.

Bullet Summary

  • TIBlender is a multi-agent system designed to aggregate and analyze fragmented cyber threat signals from four social media platforms—X, Reddit, Telegram, and Discord—to generate structured and actionable threat intelligence (TI) reports.
  • It employs role-specialized large language model (LLM) agents to perform a five-step pipeline: adaptive collection, campaign clustering, multi-perspective investigation, adaptive evaluation, and structured report generation with cross-cycle learning for con...
  • The system tackles challenges of integrating fragmented, multilingual and platform-specific threat data, validating the reliability of extracted threats, and filtering noise in dynamic threat environments.
  • In a 31-day real-world deployment, TIBlender processed nearly 874,000 posts, clustered into over 184,000 groups, producing 8,288 actionable reports that included many Indicators of Compromise (IoCs) absent from existing public feeds or single-platform analy...
  • TIBlender demonstrates superior IoC extraction quality and coverage compared to leading baselines when restricted to single-platform inputs, and uniquely combines multiperspective evidence across platforms to boost early warning capabilities.

Beyond Single-Policy: Evaluating Composed Organization-Specific Policy Alignment in LLM Chatbots

OpenAlex · arXiv (Cornell University) repository OpenAlex Governance and Policy Benchmarks and Evaluation

Yingjie Liu, Yongxiang Hu, Xuan Wang (55634), Yilun Li, Yunlei Wei, Xiaoyu Wang, Yangfan Zhou

Published 2026-06-03

Venue: arXiv (Cornell University)

Open Source Record

Abstract

Large language model chatbots are increasingly deployed in organizational settings such as healthcare, finance, and public services. Evaluating policy alignment is therefore critical to reliable chatbot deployment. By analyzing real-world user queries, we identify composed-policy violation is prevalent in various chatbots but overlooked by existing benchmarks. This paper present COPAL, an automated tool for evaluating composed-policy alignment in chatbots. COPAL efficiently generates queries that trigger composed-policy failures in chatbots via empirically derived interaction patterns and explicit handling contracts. Queries generated by COPAL expose substantial query handling failures: across 9 served models, composed-policy queries yield a 33.1% error rate on average, indicating that composed-policy alignment warrants further investigation.

Bullet Summary

  • Large language model chatbots deployed in organizational settings (healthcare, finance, public services) face challenges in aligning with multiple simultaneously applicable policies, not just single-policy scenarios.
  • Existing policy-alignment benchmarks focus on one policy at a time, failing to capture real-world composed-policy violations where multiple policies apply concurrently, leading to frequent errors.
  • COPAL is introduced as an automated evaluation framework that generates user queries designed to trigger composed-policy failures, leveraging empirically derived interaction patterns and explicit handling contracts.
  • COPAL constructs composed-policy tests using grounded policy clauses extracted and normalized, four key relation patterns (scope restriction, prerequisite gating, selective disclosure, workflow transfer), and naturalistic user query generation to reveal com...
  • Empirical evaluation of nine advanced LLM chatbots over 30 synthesized organizational "company worlds" and over 8,000 composed-policy judgments shows an average composed-policy query error rate of approximately 33%, significantly higher than single-policy e...

SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

arXiv preprint arXiv Agent-to-Agent Communication Benchmarks and Evaluation Trust and Identity

Joel Sol, Homayoun Najjaran

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

Bullet Summary

  • SMAC-Talk extends the StarCraft Multi-Agent Challenge by integrating natural language observation, action, and communication channels to evaluate large language model (LLM) agents in cooperative multi-agent environments featuring decentralized control and p...
  • The benchmark includes various communication scenarios such as no communication, free communication, and adversarial conditions with a deceptive communicator agent that tries to mislead allies solely through natural language.
  • Adapters are developed to convert numerical game observations to textual input for LLMs and translate LLM-generated natural language commands into discrete game actions, enabling compatibility with StarCraft II unit micromanagement tasks.
  • Evaluation involved four Qwen3.5 LLM sizes (4B to 122B parameters) and different agent architectures including zero-shot, chain-of-thought reasoning, ReAct, and deceptive communicators, studying effects of model size and reasoning structure on coordination.
  • Results show that reasoning agents using internal chain-of-thought outperform zero-shot and ReAct agents, and larger models better leverage communication and resist deceptive messages, with minimal reliable performance observed at 9B parameters.

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

arXiv preprint arXiv Prompt Injection Benchmarks and Evaluation

Kargi Chauhan, Pratibha Revankar

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we treat multi-turn exfiltration as a cumulative information-flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss. These results are preliminary: the multi-turn benchmark is in-house and small, the activation method requires white-box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text-level output filters.

Bullet Summary

  • LLM agents risk indirect prompt injections that exfiltrate sensitive credentials by mixing these with untrusted content within their context windows, posing significant security vulnerabilities.
  • The paper proposes a novel Agentic Immune System (AIS) prototype that integrates three complementary defenses: activation probing before output emission (CIFT), calibrated honeytoken canaries derived via differential privacy and conformal prediction (DP-HON...
  • The CIFT method leverages pre-output transformer activation features, analyzing deviations across model layers with Mahalanobis distance and learned weights to detect credential access attempts prior to token generation, offering robustness against evasive...
  • DP-HONEY generates realistic, statistically indistinguishable honeytokens using differentially private bigram character models, enabling the detection of secret exfiltration attempts without impairing normal tool operations.
  • NIMBUS addresses multi-turn, low-rate credential leakage by accumulating an estimated information leakage budget across conversation turns, capturing covert exfiltration that single-turn detectors commonly miss.

Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems

arXiv preprint arXiv Governance and Policy Benchmarks and Evaluation

Zexun Wang

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Agent systems execute through runtimes with very different control points: local coding tools, framework SDKs, managed agent platforms, API gateways, and observer-only integrations. A high-risk action such as publishing data externally may therefore appear as a shell command in one runtime, a tool call in another, and a hosted session transition in a third. This makes it difficult to answer a basic governance question consistently: what action was authorized, under whose authority, with what approval semantics, and with what evidence after execution? This paper presents Proof-Carrying Agent Actions (PCAA), a runtime-neutral governance model centered on an action certificate rather than on a vendor-native session record. PCAA organizes control around five checkpoints: pre-action admissibility, action open, assumption capture, approval, and outcome closure. It binds these checkpoints to a portable action envelope, runtime and approval receipts, and replay-ready proof. The model is extended in two practical ways: the certificate is externality-aware, carrying boundary facts such as destination visibility and account provenance, and approval is described by explicit enforceability classes rather than by a single reviewed or unreviewed bit. We study the model through a reference implementation in a heterogeneous agent control plane and a disclosure-bounded evaluation protocol. On a protected benchmark expanded from 24 executable seeds to 96 traces across four runtime families, PCAA preserves route quality while exposing distinct failure modes under ablation. The paper contributes a systems formulation of runtime governance around certificate-bearing actions and an implementation-grounded account of how that formulation can remain portable under runtime churn without collapsing into vendor-specific control surfaces.

Bullet Summary

  • The paper addresses the challenge of consistent runtime governance across heterogeneous agent systems with varying control points, focusing on authoritative authorization, approval semantics, and post-execution evidence.
  • Introduces Proof-Carrying Agent Actions (PCAA), a runtime-neutral governance model that centers on a portable action certificate rather than vendor-specific session data.
  • PCAA organizes governance into five checkpoints: pre-action admissibility, action open, assumption capture, approval, and outcome closure, all bound to a portable action envelope with runtime and approval receipts and replay-ready proof.
  • Extends the model to include externality-aware certificates carrying context like destination visibility and account provenance, and explicit approval enforceability classes beyond binary reviewed/unreviewed status.
  • Implemented a reference system in a heterogeneous agent control plane, demonstrating that PCAA maintains governance semantics, routing quality, and proof stability across diverse runtime environments.

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

arXiv preprint arXiv Benchmarks and Evaluation Agent-to-Agent Communication

Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Deng, Jian Sun, Wei Ma

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade-off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open-ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi-objective tension into single-scalar maximization. Here we present EvoDrive, the first automated, LLM-based agentic evolution framework for multi-objective scenario generation. EvoDrive employs a simulator-grounded actor-critic architecture where a memory-driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self-evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack-realism trade-offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.

Bullet Summary

  • EvoDrive is an LLM-based agentic evolution framework designed to generate multi-objective safety-critical scenarios for autonomous driving, balancing adversariality and realism.
  • The framework employs a simulator-grounded actor-critic architecture with memory-driven actors proposing generator improvements and specialized critics filtering implausible scenario candidates to maintain realism and validity.
  • A Pareto archive preserves diverse trade-offs between attack potential and scenario realism, avoiding the pitfalls of single-objective scalar maximization common in prior methods.
  • A self-evolving world evaluator predicts candidate metrics to efficiently allocate costly simulation budgets, prioritizing promising scenarios for full evaluation while maintaining auditability and calibration.
  • Experiments on MetaDrive and CARLA simulators demonstrate EvoDrive's capability to significantly expand the Pareto frontier, producing valuable scenarios that enhance downstream autonomous driving policy robustness.

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

arXiv preprint arXiv Benchmarks and Evaluation Orchestration Risk

Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.

Bullet Summary

  • Multi-agent systems (MAS) utilizing large language models (LLMs) excel in complex multi-step tasks but are vulnerable to cascading failures due to single-step execution errors.
  • Failure attribution, the task of identifying the root cause step leading to failure in MAS, is critical for improving system reliability but existing LLM-based methods are computationally expensive and sensitive to noisy execution logs.
  • StepFinder is proposed as a temporal semantic framework that encodes execution trajectories into temporal semantic sequences using LLMs only for embedding, followed by lightweight temporal modeling and attention modules to capture sequential and cross-step...
  • The framework introduces multi-scale temporal differencing and position bias mechanisms to refine step-level error scores, improving precision in root cause identification by favoring earlier error steps and highlighting abnormal temporal fluctuations.
  • StepFinder integrates agent-aware attention with gating mechanisms and applies an auxiliary temporal consistency loss to encourage capturing temporal dependencies and support accurate failure localization.

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

Merged record merged scholarly record arXiv Orchestration Risk Benchmarks and Evaluation

Farooq Shaikh

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

Bullet Summary

  • FORGE addresses the challenge of high vulnerability disclosure volumes overwhelming organizational assessment capacity by integrating three siloed research domains: proof-of-concept generation, vulnerability prioritization, and detection rule engineering.
  • The system employs a multi-agent pipeline consisting of Intel, Generator, Planner, Exploit, and Detector agents to generate vulnerable applications from CVE metadata, perform coached multi-turn exploitation assessed via a four-level graduated taxonomy (L0 t...
  • FORGE's graduated exploitation depth mechanism provides nuanced exploitation assessments rather than binary success/failure, facilitating richer behavioral data for detection rule engineering and more accurate validation of prioritization models.
  • A tiered knowledge architecture accumulates intelligence across CVE assessments, transferring build insights and exploitation experience across diverse CVE types, programming languages, and CWE classes, thereby improving efficiency and reliability.
  • Evaluation on 603 CVEs from the CVE-GENIE dataset demonstrated a 67.8% end-to-end exploitation success rate at L1+ with an average cost of $1.50 per CVE, spanning eight programming languages and 187 CWE types, highlighting the system's scalability and econo...

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

arXiv preprint arXiv Orchestration Risk Benchmarks and Evaluation

Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui, Qi Zhu, Fei Mi, Hongning Wang, Minlie Huang

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.

Bullet Summary

  • RUBAS addresses the challenge of aligning large language model (LLM) agents for safe, real-world tool usage, overcoming limitations of coarse or static supervision methods.
  • The framework decomposes agent safety into four dimensions: tool-use safety, argument safety, response safety, and helpfulness, forming structured, interpretable rubrics for behavior evaluation.
  • Rubric-based rewards are binary criteria aggregated into scalar rewards over entire agent trajectories, facilitating reinforcement learning to optimize safety alongside task completion.
  • RUBAS employs GPT-5.1 to generate instance-specific rubrics for each task, enhancing the precision of safety evaluation and enabling scalable annotations that align closely with human judgments.
  • Training utilizes Group Relative Policy Optimization (GRPO) with rubric-based rewards combined with completeness and reasoning trace incentives to ensure thorough and safe agent responses.

Agentic Relationship Harm: Benchmarking and Gating Relational Manipulation in AI Agents

arXiv preprint arXiv Benchmarks and Evaluation Governance and Policy Trust and Identity

Pei-Sze Tan, Tasuku Igarashi, Isao Echizen

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

AI agents built on large language models can assist not only legitimate tasks but also relational manipulation. AI agents can be used to help a user maintain a deceptive identity, intensify emotional dependency, isolate a target, or prepare for later extraction. We conceptualise this risk as agentic relationship harm: workflow-level assistance that can exploit recipient vulnerability, persuasive influence, and relational power asymmetry. Existing safety evaluations and generic guardrails often treat harmfulness as a property of isolated outputs, missing role-sensitive interaction patterns. To study this, we introduce a 110-prompt benchmark with balanced attacker- and victim-side cases, a relationship-specific labelling framework, and a lightweight post-generation policy gate for local agent deployments. In our evaluation, the relationship-specific gate outperforms generic safety prompting under automated judging, with no judge-identified harmful-compliance cases on the main benchmark or multi-turn stress test while preserving victim-side protective intervention. These results suggest that relationship harm is a distinct sociotechnical risk surface and that role-sensitive evaluation plus lightweight policy gating offers a practical path beyond generic refusal prompting.

Bullet Summary

  • AI agents built on large language models can facilitate relational manipulation, termed 'agentic relationship harm', by assisting users in deceptive, coercive, or exploitative workflows affecting human-human relationships.
  • Traditional AI safety evaluations often focus on isolated outputs and do not capture manipulative interaction patterns that depend on user roles, such as attacker versus victim, necessitating role-sensitive assessment frameworks.
  • The authors introduce a 110-prompt benchmark with balanced attacker- and victim-side cases, a relationship-specific labeling framework capturing recipient vulnerability and power asymmetry, and a structured LLM judge to evaluate harmful assistance versus pr...
  • A lightweight post-generation policy gate is proposed that effectively blocks harmful compliance in attacker-mode prompts while preserving protective interventions for victim-mode prompts, outperforming generic safety prompting methods.
  • Relationship harm involves nuanced trade-offs where certain AI-assisted behaviors can be harmful if aiding an attacker but protective if supporting a victim, highlighting the need for intent- and role-sensitive mitigation strategies.

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

arXiv preprint arXiv Governance and Policy Trust and Identity Benchmarks and Evaluation

Thanh Luong Tuan, Abhijit Sanyal

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

Bullet Summary

  • The paper addresses the critical challenge of pre-deployment verification of enterprise AI agents to ensure regulatory compliance, safety, and trustworthiness, especially in highly regulated industries like finance and healthcare.
  • It introduces an ontology-grounded verification framework comprising three core components: (1) an Agent Operational Envelope formalizing permissions, constraints, safety properties, and governance rules; (2) an ontology-to-scenario generation pipeline that...
  • A controlled pilot study was conducted across four regulated industries and two jurisdictions (United States and Vietnam), generating 1,800 test scenarios from 125 primary regulatory requirements and 25 injected faults to empirically evaluate the framework.
  • Ontology-grounded scenario generation significantly outperformed traditional persona-based generation methods in regulatory coverage (48.3% vs. 33.1%) and domain specificity, demonstrating higher alignment with complex regulatory environments and industry c...
  • The framework employs advanced evaluation techniques including LLM-as-judge agents assessing compliance, safety, and adversarial resilience, supported by formal verification methods such as temporal logic and bounded model checking to create a spectrum of p...

Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

Merged record merged scholarly record arXiv Governance and Policy Benchmarks and Evaluation

Marcus Rüb, Michael Gerhards

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence. We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.

Bullet Summary

  • Challenges of deploying Large Language Model (LLM)-based agentic AI on deeply embedded microcontrollers due to strict memory and energy constraints in pervasive computing environments.
  • Proposal of a modular, tiered reference architecture for Embedded Agent Systems that separates On-Device Agents executing compressed neural networks for low-latency, privacy-critical tasks from Cloud-Augmented Agents using Small Language Models (SLMs) for a...
  • Introduction of a cross-cutting Governance Layer to provide observability, policy enforcement, and safety across distributed autonomous device fleets, supporting auditability and compliance.
  • Definition of two hardware architecture flavors: Flavor A (Autonomous Gateway Agent with local SLMs enabling offline operation and privacy but limited capacity) and Flavor B (Tethered MCU Agent with minimal local logic relying on cloud LLMs for complex reas...
  • Detailed analysis of architectural trade-offs involving latency, energy consumption, memory requirements, privacy, and connectivity to tailor agent deployment based on hardware capability and application needs.

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

arXiv preprint arXiv Benchmarks and Evaluation Memory Poisoning

Hao Cheng, Changtao Miao, Tianle Song, Yin Wu, He Liu, Erjia Xiao, Junchi Chen, Xiaoyu Shi

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.

Bullet Summary

  • The paper introduces SeClaw, a novel framework designed to evaluate security risks in autonomous large language model (LLM) agents by combining specification-driven security task synthesis with execution-based security evaluation.
  • SeClaw addresses limitations in existing benchmarks that rely on manually curated tasks by enabling scalable, continuous, and controllable construction of security tasks from structured risk specifications.
  • The framework simulates multi-turn agent-user interactions in isolated Docker environments, allowing reproducible and auditable benchmarking with detailed, trajectory-aware assessment of unsafe behaviors beyond just final agent outputs.
  • It targets comprehensive security risks affecting autonomous agents including resource risks (malicious tools/skills), task risks (such as jailbreak attacks), environment risks (adversarial inputs), and intrinsic agent behaviors.
  • SeClaw's two main stages are Security Task Synthesis—transforming abstract security risks into executable tasks through prototype synthesis, task instantiation, and validation—and Security Evaluation, which runs tasks in a sandbox to analyze agent behavior...

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Merged record merged scholarly record arXiv Governance and Policy Benchmarks and Evaluation

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García, Annemarie F. Laudanski, Álvaro Gutiérrez, Eduardo Rocon, Manuel Cebrian

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.

Bullet Summary

  • POIROT addresses the challenge of detecting and attributing failures in Large Language Model-based Multi-Agent Systems (LLM-MAS), which are critical given the resistance of emergent failures and hallucinations to traditional characterization methods and the...
  • Unlike centralized evaluation methods reliant on domain experts or single LLM judges, POIROT decentralizes diagnostic evaluation by repurposing the system's own agents to collaboratively detect failures using their inherent epistemic diversity, mitigating s...
  • POIROT employs a structured N-dimensional hazard space representing potential failure sources across agents, software, hardware, and human components, enabling precise and interpretable fault attribution through an error vector mechanism.
  • The protocol operates in five phases: hazard space definition, agent self-assessment for failures, peer interrogation involving structured dialogue to validate observations, private voting by each agent on likely fault causes, and hazard-aware aggregation u...
  • Comprehensive evaluations on benchmarks such as COR T EX (a clinical exoskeleton therapy system) and TradingAgents (a twelve-agent financial trading system) demonstrate that POIROT consistently outperforms single-LLM evaluators, with gains amplifying as pro...

Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning

Semantic Scholar · Semantic Scholar scholarly work Semantic Scholar Benchmarks and Evaluation Trust and Identity

Eden Yavin, Gal Engelberg, Konstantin Koutsyi, Leon Goldberg, Gali Bar-On

Published 2026-06-01

Venue: Semantic Scholar

Open Source Record

Abstract

The rapid proliferation of multi-cloud and SaaS platforms has transformed Identity Security Posture Management (ISPM) into a fundamentally cross-vendor challenge: critical misconfigurations and privilege escalation paths increasingly span multiple identity providers, infrastructure layers, and authentication systems never designed to interoperate. Existing evaluations focus on isolated single-platform environments and provide no means to assess whether an AI agent can reason across these fragmented boundaries. To address this gap, we introduce the Cross-Vendor Sola ISPM Benchmark, a production-grade benchmark of 50 data-grounded tasks requiring multi-hop entity resolution and cross-system correlation across eight integrated enterprise platforms including AWS, Okta, Azure AD, and Google Workspace. We also contribute an evaluation framework measuring not only final answer correctness but also evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence. We evaluate the Sola AI Agent across five context configurations - from no injected metadata to full schema, graph, and retrieval context - using three frontier LLMs. Results show that structured relational context improves answer correctness by approximately 34% relatively and reduces exploration queries by approximately 70% across all tested models, with the largest gains driven by cross-vendor graph topology. Our findings indicate that frontier LLMs possess substantial latent security reasoning capability, but reliable cross-vendor identity analysis is fundamentally constrained by the availability of explicit relational context for entity resolution and evidentiary grounding. Under full context, the best configuration achieves 78% answer correctness while reducing complete failure to 4%.

Bullet Summary

  • Identity Security Posture Management (ISPM) increasingly requires reasoning across multiple cloud and SaaS vendors due to misconfigurations and privilege escalation paths spanning disparate systems.
  • Current evaluations focus on isolated single-platform environments and lack mechanisms to assess AI agents' ability to reason across fragmented, cross-vendor identity systems.
  • Introduced the Cross-Vendor Sola ISPM Benchmark, consisting of 50 tasks requiring multi-hop entity resolution and correlation across eight enterprise platforms including AWS, Okta, Azure AD, and Google Workspace.
  • Developed an evaluation framework that assesses not only final answer correctness but also evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence.
  • Evaluated the Sola AI Agent using three leading large language models (LLMs) across five context configurations ranging from no metadata to full schema, graph, and retrieval context.
Load more articles