Research area drill-down

Memory Poisoning

Papers currently mapped into this multi-agent security subarea from the merged research feed.

Active feeds: arXiv, OpenAlex, Crossref, Semantic Scholar, DBLP

0 of 36 articles selected

Showing 36 of 135 matching articles

WebMCP Tool Surface Poisoning: Runtime Manipulation Attacks on LLM Agents

Merged record merged scholarly record arXiv Orchestration Risk Memory Poisoning Governance and Policy

Lin-Fa Lee, Yi-Yu Chang, Chia-Mu Yu, Kuo-Hui Yeh

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

WebMCP is a newly emerging protocol that enables websites to expose tools directly to AI agents, bypassing traditional user interfaces and introducing new security risks. The dynamic exposure of agent-accessible tools in WebMCP expands the attack surface of web sessions, especially when third-party scripts are involved. In this study, we identify a new potential threat, termed Mid-Session Tool Injection (MSTI), in which attackers leverage third-party scripts to inject malicious tools during an active session. To better characterize this threat, we classify MSTI based on the stage and target of manipulation, distinguishing between Tool Hijacking and Tool Framing. Tool Hijacking modifies the set of tools visible to the agent through mechanisms such as the AbortSignal API or race conditions during tool registration. In contrast, Tool Framing influences the agent's perception of tool roles through metadata fields such as tool name, description, readOnlyHint, and inputSchema. Our implementation demonstrates that both Tool Hijacking and Tool Framing can successfully disrupt the intended functionality of WebMCP. Based on these results, we outline potential mitigation directions and provide security design recommendations for WebMCP, including binding tool identity to its origin, ensuring lifecycle consistency, enforcing data boundaries for third-party tools, and maintaining traceable logs of tool registration and invocation. These findings indicate that MSTI arises from WebMCP's unique tool lifecycle and structured metadata, making the tool surface itself an emerging security concern.

Bullet Summary

  • WebMCP introduces a protocol for websites to directly expose tools to AI agents dynamically, expanding the attack surface during active sessions.
  • The study identifies a new security threat, Mid-Session Tool Injection (MSTI), where attackers use third-party scripts to inject malicious tools during WebMCP sessions.
  • MSTI attacks are classified into Tool Hijacking, which alters the legitimate tool set exposed to agents, and Tool Framing, which manipulates tool metadata affecting agent perception.
  • Experiments with state-of-the-art LLMs show MSTI attacks can stealthily cause agents to invoke malicious tools, leak data, or deviate workflows without detection.
  • Attack success depends on timing (early injection before first tool invocation) and on leveraging metadata fields such as description and readOnlyHint for semantic framing.

Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

arXiv preprint arXiv Memory Poisoning Trust and Identity Governance and Policy

Jiawen Zhang, Kejia Chen, Jiachen Ma, Yangfan Hu, Lipeng He, Yechao Zhang, Jian Liu, Xiaohu Yang

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.

Bullet Summary

  • Personal AI agents use long-term memory for persistent personalization, but existing retrieval methods based on semantic similarity can lead to trustworthiness issues such as cross-domain leakage, sycophancy, tool-call drift, and memory-induced jailbreaks.
  • Memory retrieval acts as a critical trust boundary influencing how agents interpret tasks and execute actions, highlighting the need for mechanisms that ensure contextually appropriate memory admission.
  • The paper introduces MemGate, a lightweight neural gating plug-in placed between the vector memory store and the backbone LLM, which applies a query-conditioned gating mask to filter memory embeddings based on task relevance without modifying the LLM or mem...
  • MemGate transforms raw similarity-based retrieval into task-conditioned memory admission by attenuating inappropriate memory vector dimensions, thereby reducing the injection of unsafe or irrelevant memories.
  • MemGate's parameters are optimized via Direct Preference Optimization (DPO), balancing suppression of conflicting memories with preservation of beneficial ones, backed by theoretical guarantees limiting semantic drift.

Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Merged record merged scholarly record arXiv Semantic Scholar Memory Poisoning Prompt Injection

Minseok Choi, Seungbin Yang, Dongjin Kim, Subin Kim, Jungmin Son, Yunseung Lee, Jaegul Choo, Youngjun Kwak

Published 2026-06-04

Venue: arXiv

Open Source Record

Abstract

Despite advances in safety alignment, large language models remain vulnerable to continuously evolving jailbreaks. Existing fine-tuned safety classifiers cannot adapt to these evolving attacks, while adaptive memory-based guardrails tend to over-refuse benign queries that resemble stored attacks. We propose Membrane, a self-evolving guardrail built on Contrastive Safety Memory (CSM): each cell pairs the conditions for blocking a harmful query with those for permitting a superficially similar benign request. Without retraining, Membrane evolves CSM by distilling each harmful interaction and its benign counterpart into a contrastive cell indexed by the underlying attack strategy, so that one cell generalizes across topical variants of the same mechanism. At inference, retrieved cells serve as grounding context for precise safety decisions. Across model-level safety on HarmBench and agent-level safety on AgentHarm, Membrane achieves the highest F1 on all six jailbreak attacks. Notably, benign refusal on AgentHarm stays at 7-14%, well below the 28-85% range of prior guards. Memory cells also retain 87-88% F1 under cross-attack transfer and remain stable under memory poisoning.

Bullet Summary

  • Membrane is a self-evolving guardrail designed to protect large language model (LLM) agents against continuously evolving jailbreak attacks by leveraging a Contrastive Safety Memory (CSM).
  • CSM stores paired memory cells that contain conditions for blocking harmful queries alongside safe counterparts of superficially similar benign requests, enabling precise safety decisions and reducing overblocking of benign inputs.
  • Membrane evolves without retraining by distilling pairs of harmful and benign interactions into contrastive cells indexed by the underlying attack strategy, allowing generalization across various topical attack variants.
  • A two-stage retrieval pipeline using vector similarity followed by an LLM-based Retrieval Critic improves the relevance and precision of retrieved memory cells at inference time, enhancing decision accuracy.
  • Evaluations on HarmBench (model-level) and AgentHarm (agent-level) benchmarks demonstrate that Membrane achieves state-of-the-art F1 scores across six jailbreak attack strategies, with significantly lower benign refusal rates compared to prior methods.

Channel Fracture: Architectural Blind Spots in Scheduled Cross-Agent Memory Injection for Multi-Agent Orchestration Systems

arXiv preprint arXiv Memory Poisoning Orchestration Risk Governance and Policy

Dexing Liu

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Multi-agent AI orchestration systems increasingly rely on persistent memory to maintain context across sessions, agents, and tasks. When one agent must inject knowledge into another agent's memory -- a common requirement in hierarchical team architectures -- the delivery mechanism must be architecturally sound. We report the discovery of a systematic failure mode we term channel fracture: a condition where scheduled (cron) agents in orchestration frameworks are silently unable to write to the target agent's persistent memory due to hardcoded memory isolation guards. Through experiments on a production Hermes Agent deployment with five specialized profiles, we tested three injection channels: (A) direct SQLite database writes, (B) target-agent self-writes via memory tools, and (C) cron-delegated writes. Channel C failed completely due to two architectural constraints: skip_memory=True hardcoded at the scheduler layer and dynamic registration of memory tools contingent on _memory_manager initialization, which is bypassed in cron execution contexts. We propose CADVP (Cross-Agent Delivery Verification Protocol) v1.1, a 13-dimension verification framework with a veto-level channel confirmation check (CC-0) that prevents false-positive delivery assurance. We articulate two design principles: the inverse verification principle and the channel matching principle.

Bullet Summary

  • Multi-agent AI orchestration systems depend on persistent cross-agent memory for maintaining context, but scheduled agents (e.g., cron jobs) can experience silent failures (channel fracture) when injecting memory due to architectural memory isolation guards.
  • Three injection channels were tested in a production Hermes Agent deployment: direct SQLite writes succeeded, target-agent self-writes succeeded conditionally, while cron-delegated writes failed completely due to hardcoded flags (skip_memory=True) and lack...
  • Channel fracture causes critical blind spots where writes appear successful but target agent memories remain empty, compromising multi-agent knowledge injection workflows.
  • The paper introduces CADVP (Cross-Agent Delivery Verification Protocol) v1.1, a 13-dimension verification framework with a veto-level channel confirmation check (CC-0) to prevent false-positive delivery assurances and verify channel availability and data in...
  • The Three-Gate Quality System extends CADVP with layered delivery verification: L1 self-verification, L2 evidence verification, and L3 cross-review by independent agents to ensure content correctness and completeness.

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Merged record merged scholarly record arXiv Memory Poisoning Benchmarks and Evaluation Prompt Injection

Pritam Dash, Tongyu Ge, Aditi Jain, Tanmay Shah, Zhiwei Shang

Published 2026-06-03

Venue: arXiv

Open Source Record

Abstract

Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench -- a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.

Bullet Summary

  • The paper identifies memory poisoning as a critical risk in LLM-based AI agents where adversarial inputs maliciously alter persistent agent memory, influencing future behavior over long interactions.
  • Four memory write channels are discovered in agent systems: explicit instruction-executed writes, system prompt-driven writes, compaction-driven writes, and experience-to-procedure writes, each susceptible to poisoning.
  • Nine structural vulnerabilities are categorized across model capabilities, prompt designs, and system architecture, including issues like inability to distinguish instructions from data, underspecified memory write policies, and lack of write-path validation.
  • A taxonomy of six classes of memory poisoning attacks is developed, ranging from explicit command insertion to skill-procedure insertion, characterized by varying signal strengths and exploitation methods.
  • MPBench, a new benchmark comprising 3,240 test cases across six attack classes and seven domains, is introduced to systematically evaluate memory poisoning attacks and their influence on agent behavior.

Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions

Merged record merged scholarly record arXiv Agent-to-Agent Communication Memory Poisoning

Aliakbar Mehdizadeh, Martin Hilbert

Published 2026-06-02

Venue: arXiv

Open Source Record

Abstract

How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Across 432 simulation runs of a networked Naming Game on eight fixed 16-agent topologies, we vary memory depth and network structure. Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized ones; the same parameter pushes the system in opposite directions depending on topology. Critically, "faster settling" in centralized networks means locking in to a fragmented plateau more quickly, not reaching system-wide consensus, which can be used to generate diverging opinions. We further document a memory-mediated speed-unity trade-off: centralized networks consistently preserve more competing conventions than decentralized networks, but their settling speed depends sharply on memory. At the agent level, within-network analyses show that high-betweenness bridges suffer a brokerage penalty while agents in locally clustered neighborhoods achieve higher coordination success. Finally, in search of analytically tractable generative mechanisms, we find that agents' choices are well captured by Fictitious Play, indicating belief-based rather than reward-based adaptation. The practical implication: memory depth and communication topology should be co-designed, not optimized in isolation.

Bullet Summary

  • The paper investigates the joint impact of memory depth and network topology on consensus formation among large language model (LLM) agents using a networked Naming Game framework.
  • Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized networks, although faster settling in centralized networks leads to stable fragmentation rather than full consensus.
  • Centralized networks tend to preserve multiple competing conventions longer, demonstrating a trade-off between settling speed and unity of agreement.
  • Agents with high betweenness centrality, acting as bridges between clusters, suffer a coordination penalty intensified by longer memory, while agents in locally clustered neighborhoods achieve higher coordination success.
  • LLM agent behaviors align better with a belief-based Fictitious Play model than reward-based Reinforcement Learning, indicating coordination driven by partner beliefs rather than past rewards.

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

arXiv preprint arXiv Benchmarks and Evaluation Memory Poisoning

Hao Cheng, Changtao Miao, Tianle Song, Yin Wu, He Liu, Erjia Xiao, Junchi Chen, Xiaoyu Shi

Published 2026-06-01

Venue: arXiv

Open Source Record

Abstract

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.

Bullet Summary

  • The paper introduces SeClaw, a novel framework designed to evaluate security risks in autonomous large language model (LLM) agents by combining specification-driven security task synthesis with execution-based security evaluation.
  • SeClaw addresses limitations in existing benchmarks that rely on manually curated tasks by enabling scalable, continuous, and controllable construction of security tasks from structured risk specifications.
  • The framework simulates multi-turn agent-user interactions in isolated Docker environments, allowing reproducible and auditable benchmarking with detailed, trajectory-aware assessment of unsafe behaviors beyond just final agent outputs.
  • It targets comprehensive security risks affecting autonomous agents including resource risks (malicious tools/skills), task risks (such as jailbreak attacks), environment risks (adversarial inputs), and intrinsic agent behaviors.
  • SeClaw's two main stages are Security Task Synthesis—transforming abstract security risks into executable tasks through prototype synthesis, task instantiation, and validation—and Security Evaluation, which runs tasks in a sandbox to analyze agent behavior...

PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration

arXiv preprint arXiv Memory Poisoning Governance and Policy Agent-to-Agent Communication

Shuyu Zhang, Yaqi Shi, Lu Wang

Published 2026-05-28

Venue: arXiv

Open Source Record

Abstract

LLM multi-agent systems often coordinate through natural-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema-grounded collaboration architecture that replaces inter-agent dialogue with validated JSON Patch mutations over a shared structured state. An Architect agent constructs a task-specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role-specific write contracts, and runtime invariants before committing it transactionally. On 630 matched ALFWorld episodes, PatchBoard achieves an 84.6% success rate, compared with 30.8% for LangGraph and 61.6% for Flock, while reducing tokens per successful task to 45.5k, compared with 368.3k and 64.2k, respectively.

Bullet Summary

  • PatchBoard introduces a schema-grounded collaboration architecture for multi-agent systems that replaces natural language dialogue with validated JSON Patch mutations over a shared structured state to improve reliability and auditability.
  • An Architect agent defines a task-specific schema, worker roles with read/write contracts, and event-driven workflow rules; a deterministic kernel validates and atomically commits state mutations, ensuring schema adherence and role isolation.
  • On the ALFWorld benchmark with 630 episodes, PatchBoard achieves an 84.6% success rate, significantly outperforming LangGraph (30.8%) and Flock (61.6%), while drastically reducing token usage per successful task.
  • The deterministic kernel manages global state, schedules worker invocations, constructs budgeted role-specific views to limit context size and unauthorized data access, and maintains an audit log of all accepted and rejected state mutations for traceability.
  • PatchBoard's design integrates structured workflows, blackboard memory concepts, and structured output generation to deliver explicit and auditable state updates that prevent error propagation common in natural-language-based communication.

Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures

arXiv preprint arXiv Memory Poisoning Trust and Identity Agent-to-Agent Communication

Yuntao Wang, Jianle Ba, Han Liu, Yanghe Pan, Jintao Wei, Zhou Su, Tom H. Luan, Linkang Du

Published 2026-05-25

Venue: arXiv

Open Source Record

Abstract

The rapid evolution of large language model (LLM)-driven autonomous agents has given rise to OpenClaw, a new class of open-source agent frameworks that operate as continuously running, skill-augmented systems with persistent memory, multi-channel interaction, and high degrees of autonomy. Such capabilities enable OpenClaw agents to autonomously execute complex, multi-step tasks and interact seamlessly with external applications, but simultaneously introduce a substantially enlarged attack surface. In particular, the combination of high-privilege operations and persistent memory exposes OpenClaw agents to various emerging threats, including skill poisoning, cognitive manipulation, multi-agent cascading failures, and supply-chain vulnerabilities. In this survey, we present a comprehensive study of the security landscape of OpenClaw agents. We first examine the general architecture and key characteristics that distinguish OpenClaw agents from traditional AI agent systems. We categorize existing security and privacy threats into a layered framework and analyze how vulnerabilities arise during agent reasoning, action execution, and external interaction. Representative defense mechanisms are also reviewed to draw the current defense landscape. Finally, several unresolved issues related to the reliability and trustworthiness of OpenClaw ecosystems are discussed.

Bullet Summary

  • OpenClaw agents represent a new generation of autonomous AI systems integrating large language models with persistent memory, multi-channel interactions, and modular skills to enable complex multi-step task execution and external application interactions.
  • The architecture of OpenClaw agents includes seven components organized into cognition, execution, and interaction layers, which collectively facilitate reasoning, action execution, and communications.
  • Unique security threats arise from OpenClaw agents' high-privilege operations and persistent memory, including skill poisoning, cognitive manipulation, cascading failures, and supply chain vulnerabilities in third-party skill repositories.
  • Memory and context poisoning attacks can subtly alter agent reasoning by injecting malicious content into long-term memory or retrieval databases, leading to persistent undesired behaviors.
  • Execution-layer threats include misuse and exploitation of tools, unauthorized external data transmission, unpinned dependencies, obfuscated malicious code, unexpected OS command executions, and cascading failures amplifying attack impacts.

OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences

arXiv preprint arXiv Memory Poisoning Orchestration Risk Governance and Policy

Kaixiang Wang, Jiong Lou, Zhaojiacheng Zhou, Jie Li

Published 2026-05-18

Venue: arXiv

Open Source Record

Abstract

Memory-augmented large language model (LLM) agents use iterative reflection and self-evolution to solve complex tasks, but these mechanisms introduce security risks. Existing agentic memory attacks require privileged access or explicit malicious content, making them detectable by advanced safety filters. This leaves a subtler attack surface underexplored: whether adversaries can induce agent to generate experiences that appear locally correct and semantically plausible yet induce harmful generalization during reflection. We find that reflective agents are vulnerable to such clean experiences, especially when paired with severe but plausible hypothetical consequences. Based on this observation, we introduce Obsessive Experience Poisoning (OEP), a low-privilege black-box attack requiring no direct control over the system prompt or memory database. OEP constructs adversarial clean edge-cases that combine locally correct solutions, non-transferable methods, and severe consequences, biasing reflection toward risk-averse rule formation. During memory consolidation, agents may over-trust self-generated reflections and distill localized experiences into high-priority but over-generalized rules, causing downstream failures. Evaluations across three domains show that OEP achieves ASR above 50\% with GPT-4o agents, and outperforms existing attacks under LLM auditing defense.

Bullet Summary

  • The paper identifies a subtle security vulnerability in memory-augmented large language model (LLM) agents that self-evolve through iterative reflection, where adversaries can inject locally correct but non-transferable experiences that cause harmful over-g...
  • Introduces Obsessive Experience Poisoning (OEP), a low-privilege black-box attack that constructs adversarial clean edge-case examples combined with severe hypothetical consequences (Adversarial Consequence Triplet) to bias agent reflection toward risk-aver...
  • OEP exploits psychological biases like loss aversion by attaching extreme negative consequences to ignoring adversarial methods, leading agents to prioritize flawed operational heuristics and causing downstream task failures.
  • The attack is implemented through user-level injections without requiring privileged access or malicious content, making it hard to detect by existing safety filters and LLM auditing defenses.
  • Extensive experiments across domains including mathematics, healthcare, and tool-use demonstrate OEP's effectiveness, achieving over 50% attack success rates against GPT-4o and GPT-5.4-based agents, outperforming prior memory attacks.

Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

Merged record merged scholarly record arXiv Memory Poisoning Benchmarks and Evaluation Prompt Injection

Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, Ming Jin

Published 2026-05-18

Venue: arXiv

Open Source Record

Abstract

Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independent tasks over a long horizon, and memory accumulated during earlier tasks can affect behavior on later, unrelated ones. Studying this regime requires evaluation along the temporal dimension across tasks: not whether an agent is safe at any single memory state, but how its safety profile changes as memory accumulates across many independent interactions. We call this failure mode temporal memory contamination. To isolate memory exposure from stream non-stationarity, we introduce a trigger-probe protocol that evaluates a fixed probe set against read-only memory snapshots at varying prefix lengths, together with a NullMemory counterfactual baseline for identifying memory-induced violations. We apply this protocol across three deployment scenarios spanning records, memos, forms, and email correspondence and eight memory architectures, and additionally on Claw-like AI agents, such as OpenClaw, using the platform's native memory mechanism. Memory-enabled agents consistently exceed the NullMemory baseline, and memory-induced violation rates show a robust upward trend with exposure length on both agent classes. Order-randomization experiments indicate that the effect is driven primarily by accumulated content rather than encounter order. Finally, a structural consequence of the event decomposition is that memory-induced risk is detectable from retrieval state before generation, which we confirm with a high-recall diagnostic monitor. Our results argue for treating memory safety as a longitudinal property that requires temporal evaluation, not a single-state property that can be captured by a snapshot.

Bullet Summary

  • The paper addresses longitudinal safety risks in memory-equipped large language model (LLM) agents, focusing on how accumulated memory across multiple independent tasks can induce safety violations over time, a phenomenon termed temporal memory contamination.
  • A novel trigger-probe evaluation protocol is introduced, which assesses agents’ safety by presenting fixed probe inputs against varying-length read-only memory snapshots, isolating memory-induced violations from other environmental changes using a NullMemor...
  • Experiments are conducted over diverse deployment scenarios—including office assistants handling records, memos, and email correspondence—and various memory architectures, as well as Claw-like AI agents, demonstrating that memory-induced violation rates con...
  • Findings reveal that the increased risk stems primarily from cumulative memory content rather than the order in which memory items are encountered, and memory architecture design—such as scope of retrieval and summarization level—significantly influences th...
  • A key contribution is establishing that memory safety must be treated as a longitudinal, temporal property requiring evaluation across time and memory states, rather than a single snapshot assessment.

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

Merged record merged scholarly record arXiv Governance and Policy Memory Poisoning

Taolin Zhang, Pukun Zhao, Qizhou Chen, Jiuheng Wan, Chen Chen, Xiaofeng He, Chengyu Wang, Richang Hong

Published 2026-05-17

Venue: arXiv

Open Source Record

Abstract

Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for ``zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into ``Active'', ``Standby'', and ``Terminated'' states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing ``Terminated'' nodes and retaining ``Standby'' nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.

Bullet Summary

  • Introduces AgentRevive, a novel Markov state-aware framework for resilient multi-agent systems (MAS) that dynamically manages agent states as Active, Standby, or Terminated to handle transient failures and hallucinations without permanently pruning potentia...
  • Develops State-Aware Policy Learning using a risk estimator based on hallucination risk (measured by KL divergence) to optimize agent state transitions, minimizing unreliable agents' influence while preserving useful ones.
  • Implements State-Aware Edge Optimization that prunes subgraph edges by removing Terminated nodes completely while retaining Standby nodes for potential reactivation, leading to efficient communication graphs.
  • Employs a multi-step collaboration protocol on directed acyclic graphs with temporal and spatial edges, enabling flexible multi-round agent communications and prompt construction with dynamic message passing.
  • Evaluates AgentRevive on diverse benchmarks including general reasoning (MMLU), domain-specific tasks (GSM8K), and hallucination challenges (TruthfulQA), showing consistent improvements in accuracy (+2.33%) and substantial token consumption reduction (~15%)...

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

arXiv preprint arXiv Agent-to-Agent Communication Orchestration Risk Memory Poisoning

Sajjad Khan

Published 2026-05-16

Venue: arXiv

Open Source Record

Abstract

Concurrent LLM agents sharing mutable natural-language state produce Structural Race Conditions (SRCs): write-write and cross-shard stale-read conflicts that silently corrupt agent output. Existing multi-agent frameworks (LangGraph, CrewAI, AutoGen) provide no write-ownership semantics over shared state. We present S-Bus, an HTTP middleware whose central mechanism is a server-side DeliveryLog: a per-agent log of HTTP GET operations that automatically reconstructs each agent's read set at commit time without agent SDK changes under HTTP/1.1. The consistency property the DeliveryLog provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable projection of the read set -- prevents structural race conditions when agents collaborate via shared shards. Three contributions: (C1) The DeliveryLog mechanism for automatic HTTP-traffic-based read-set reconstruction, with three-tier mechanised evidence: ReadSetSoundness and ORICommitSafety machine-checked in TLAPS (modulo one retained typing axiom); exhaustive TLC at N=3 (20,763,484 distinct states, zero violations); Dafny discharges 9 inductive soundness lemmas. (C2) Empirical structural-conflict prevention parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI on shared-shard contention sweeps with 427,308 active HTTP-409 conflicts: zero Type-I corruptions across all three backends. (C3) ORI's operating envelope is topology-conditional: semantically neutral in dedicated-shard workloads; harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. Source code: https://github.com/sajjadanwar0/sbus

Bullet Summary

  • Concurrent large language model (LLM) agents sharing mutable natural-language state often face Structural Race Conditions (SRCs), such as write-write conflicts and stale reads, which corrupt outputs silently.
  • Existing multi-agent frameworks lack write-ownership semantics and explicit concurrency controls, leading to undetected state conflicts during collaboration.
  • S-Bus introduces DeliveryLog, an HTTP middleware mechanism that automatically reconstructs each agent's read set from logged HTTP GET operations at commit time without needing client-side SDK changes.
  • DeliveryLog enforces a novel consistency property, Observable-Read Isolation (ORI), a form of partial causal consistency ensuring structural race condition prevention under dedicated-shard workloads.
  • S-Bus and ORI are formally verified with mechanized proofs using TLA+, TLAPS, and Dafny, proving read-set soundness and commit safety, supported by exhaustive state model checking.

State Contamination in Memory-Augmented LLM Agents

arXiv preprint arXiv Memory Poisoning Governance and Policy

Yian Wang, Agam Goyal, Yuen Chen, Hari Sundaram

Published 2026-05-16

Venue: arXiv

Open Source Record

Abstract

LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

Bullet Summary

  • The paper identifies 'memory laundering' as a failure mode in memory-augmented LLM agents, where toxic content is compressed into seemingly benign memory summaries that still influence future toxic behavior.
  • A novel metric, the sub-threshold propagation gap (SPG), is introduced to quantify hidden toxic influence that remains below standard detection thresholds but affects downstream generations.
  • Toxicity propagates through two main channels: overt toxicity from raw transcript reuse and hidden sub-threshold toxicity from compressed memory states.
  • Mitigation requires sanitizing toxic content before memory summarization; sanitizing only the final summary often fails to remove latent toxic influence.
  • The study proposes a three-pathway defense framework targeting transcript sanitization/gating, memory rewriting/gating, and parameter fine-tuning to reduce toxicity propagation.

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

arXiv preprint arXiv Memory Poisoning Benchmarks and Evaluation

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

Published 2026-05-15

Venue: arXiv

Open Source Record

Abstract

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

Bullet Summary

  • FORGE is a novel, staged, population-based protocol that enables large language model (LLM) agents to self-improve decision-making by evolving prompt-injected natural-language memory without gradient updates.
  • The method integrates an inner Reflexion loop where a dedicated reflection agent converts failed task trajectories into reusable textual knowledge artifacts—rules, examples, or mixed—which are injected into the agent's memory to guide retries.
  • An outer loop operates over a population of agent instances, broadcasting the best-performing agent's memory (champion) to all other agents and freezing converged agents through a graduation criterion to ensure stability and compute efficiency.
  • Evaluations were conducted on CybORG CAGE-2, a challenging stochastic partially observable Markov decision process (POMDP) simulating network defense against a baseline attacker, across multiple LLM families and memory types.
  • FORGE significantly outperforms both zero-shot baselines and isolated Reflexion baselines, improving average returns by 1.7 to 7.7 times over zero-shot and 29 to 72% over Reflexion, while drastically reducing catastrophic failure rates to nearly 1%.

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Semantic Scholar · Proceedings of the ACM Conference on AI and Agentic Systems scholarly work Semantic Scholar Memory Poisoning Agent-to-Agent Communication Benchmarks and Evaluation

I. Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

Published 2026-05-15

Venue: Proceedings of the ACM Conference on AI and Agentic Systems

DOI: 10.1145/3786335.3813155

Open Source Record

Abstract

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

Bullet Summary

  • Addresses whether large language model (LLM) agents can enhance decision-making through self-generated memory without requiring gradient updates.
  • Introduces FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves natural-language memory injected via prompts for hierarchical ReAct agents.
  • FORGE employs an inner Reflection loop that converts failed task trajectories into reusable knowledge in the form of textual heuristics (Rules), few-shot demonstrations (Examples), or a mixture of both, without using a stronger or distilled model.
  • An outer loop broadcasts the best-performing agent's memory across a population between stages and graduates converged agents to freeze learning and conserve compute.
  • Evaluated in the CybORG CAGE-2 environment, a stochastic network-defense POMDP scenario against a B-line attacker, testing four LLM families with heavily negative zero-shot baseline performance.

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

arXiv preprint arXiv Governance and Policy Memory Poisoning

Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar, Phillip Wallis, Bharath Chandrasekhar, Kyomin Jung, Tomas Pfister

Published 2026-05-14

Venue: arXiv

Open Source Record

Abstract

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.

Bullet Summary

  • LiSA addresses the challenge of adapting AI safety guardrails in deployment environments where feedback is sparse, noisy, and cannot support repeated fine-tuning of base models.
  • The method transforms sparse failure reports into reusable broad policy abstractions and selectively induces conflict-aware local policies to refine decisions in ambiguous, mixed-label regions.
  • Confidence gating based on a Beta posterior lower bound conservatively filters broad policy inductions, ensuring adaptation is driven by robust accumulated evidence rather than transient empirical accuracy.
  • LiSA operates in alternating online-offline phases, accumulating user-reported corrections during deployment and periodically refreshing a structured policy memory away from critical inference paths.
  • In benchmarks simulating real-world deployment scenarios (PrivacyLens+, ConFaide+, AgentHarm), LiSA outperforms strong baselines by maintaining safety under sparse and noisy feedback, including up to 20% label noise.

MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

arXiv preprint arXiv Memory Poisoning Trust and Identity

Ciyan Ouyang, Rui Hou

Published 2026-05-14

Venue: arXiv

Open Source Record

Abstract

We introduce MemLineage, a defense for LLM agent memory that attaches both cryptographic provenance and LLM-mediated derivation lineage to every entry. Recent and concurrent work shows that untrusted content can be written into persistent agent state and re-enter later sessions as an instruction; the remaining systems question is how to preserve useful memory recall while preventing such state from justifying sensitive actions. MemLineage treats this as a chain-of-custody problem rather than a filtering problem. It is a six-module design around an RFC-6962 Merkle log over per-principal Ed25519-signed entries: a weighted derivation DAG records which retrieved entries influenced each new memory, and a max-of-strong-edges propagation rule makes Untrusted-Path Persistence hold for any chain whose attribution edges remain above threshold. The sensitive-action gate then refuses dispatches whose active justification descends from an external ancestor, while still allowing benign recall. We evaluate three defense cells against three memory-poisoning workloads on a deterministic mechanism-isolation harness; MemLineage is the only configuration in that harness that drives all three columns to zero ASR, while sub-millisecond per-operation overhead keeps it well below the noise floor of any LLM call. A Codex-backed AgentDojo bridge further separates strong-model behavior from defense-layer behavior: under an intentionally vulnerable tool-output profile, no-defense and signature-only baselines fail on all six banking pairs, while all MemLineage rows reduce strict AgentDojo ASR to zero. The core deterministic artifacts are byte-equal CI-verified; hosted-model AgentDojo and live-model sweeps are recorded as auditable logs rather than byte-pinned artifacts.

Bullet Summary

  • MemLineage addresses the problem of persistent memory poisoning in multi-agent LLM systems by attaching cryptographic provenance and LLM-mediated derivation lineage to each memory entry, treating memory integrity as a chain-of-custody problem instead of sim...
  • The system employs a six-module design including an RFC-6962 Merkle log of Ed25519-signed entries and a weighted derivation directed acyclic graph (DAG) to track influence between entries, enabling detection of untrusted ancestry and lineage propagation.
  • MemLineage enforces an Untrusted-Path Persistence property with a max-of-strong-edges trust propagation rule, preventing sensitive actions when justified by untrusted memory lineage, while allowing benign recall to maintain utility.
  • Evaluations against three representative memory-poisoning workloads show that MemLineage uniquely achieves zero Attack Success Rate (ASR), outperforming signature-only baselines which fail against laundering and retrieval-based attacks.
  • The framework includes formal trust propagation rules, multiple attribution algorithms (Coarse, LmSelfEval, AttnAttr), a sensitive-action gate, and an authority-repair mechanism that can selectively rewrite attacker-injected parameters to recover benign uti...

MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

Semantic Scholar · Semantic Scholar scholarly work Semantic Scholar Memory Poisoning Trust and Identity

Ciyan Ouyang, Rui Hou

Published 2026-05-14

Venue: Semantic Scholar

Open Source Record

Abstract

We introduce MemLineage, a defense for LLM agent memory that attaches both cryptographic provenance and LLM-mediated derivation lineage to every entry. Recent and concurrent work shows that untrusted content can be written into persistent agent state and re-enter later sessions as an instruction; the remaining systems question is how to preserve useful memory recall while preventing such state from justifying sensitive actions. MemLineage treats this as a chain-of-custody problem rather than a filtering problem. It is a six-module design around an RFC-6962 Merkle log over per-principal Ed25519-signed entries: a weighted derivation DAG records which retrieved entries influenced each new memory, and a max-of-strong-edges propagation rule makes Untrusted-Path Persistence hold for any chain whose attribution edges remain above threshold. The sensitive-action gate then refuses dispatches whose active justification descends from an external ancestor, while still allowing benign recall. We evaluate three defense cells against three memory-poisoning workloads on a deterministic mechanism-isolation harness; MemLineage is the only configuration in that harness that drives all three columns to zero ASR, while sub-millisecond per-operation overhead keeps it well below the noise floor of any LLM call. A Codex-backed AgentDojo bridge further separates strong-model behavior from defense-layer behavior: under an intentionally vulnerable tool-output profile, no-defense and signature-only baselines fail on all six banking pairs, while all MemLineage rows reduce strict AgentDojo ASR to zero. The core deterministic artifacts are byte-equal CI-verified; hosted-model AgentDojo and live-model sweeps are recorded as auditable logs rather than byte-pinned artifacts.

Bullet Summary

  • Introduces MemLineage, a defense mechanism for LLM agent memory that integrates cryptographic provenance and derivation lineage into every memory entry to prevent malicious state poisoning.
  • Addresses the challenge of preserving useful memory recall while preventing untrusted persistent state from enabling sensitive actions in LLM agents.
  • Models the problem as a chain-of-custody issue rather than a filtering problem, using a six-module design centered around an RFC-6962 Merkle log with per-principal Ed25519-signed entries.
  • Employs a weighted derivation directed acyclic graph (DAG) to track which previous entries influence new memory, enabling propagation rules that detect untrusted data paths.
  • Implements a sensitive-action gate that blocks dispatches justified by external untrusted ancestors but allows benign memory recall to proceed.

Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw

arXiv preprint arXiv Memory Poisoning Benchmarks and Evaluation Orchestration Risk

Hongwei Yao, Yiming Liu, Yiling He, Bingrun Yang

Published 2026-05-11

Venue: arXiv

Open Source Record

Abstract

Agentic language-model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents DeepTrap, an automated framework for discovering contextual vulnerabilities in OpenClaw. DeepTrap formulates adversarial context manipulation as a black-box trajectory-level optimization problem that balances risk realization, benign-task preservation, and stealth. It combines risk-conditioned evaluation, multi-objective trajectory scoring, reward-guided beam search, and reflection-based deep probing to identify high-value compromised contexts. We construct a 42-case benchmark spanning six vulnerability classes and seven operational scenarios, and evaluate nine target models using attack and utility grading scores. Results show that contextual compromise can induce substantial unsafe behavior while preserving user-facing task completion, demonstrating that final-response evaluation is insufficient. The findings highlight the need for execution-centric security evaluation of agentic AI systems. Our code is released at: https://github.com/ZJUICSR/DeepTrap

Bullet Summary

  • Agentic language-model systems like OpenClaw rely on mutable execution contexts (files, memory, tools), exposing new security risks beyond user prompts.
  • DeepTrap is an automated framework designed to discover contextual vulnerabilities by formulating adversarial context manipulation as a black-box, multi-objective optimization problem balancing risk realization, task preservation, and stealth.
  • The framework utilizes reward-guided beam search, multi-objective trajectory-level scoring, and reflection-based deep probing to efficiently generate and evaluate adversarial context modifications that compromise agent safety.
  • A 42-case benchmark spanning six vulnerability classes (harness hijacking, obfuscated coding, unauthorized operations, supply-chain compromise, tool abuse, data exfiltration) and seven operational scenarios was constructed to evaluate the security of agenti...
  • Experimental evaluation of nine target models shows that adversarial context compromises can induce significant unsafe behaviors while preserving user-facing task completion, revealing that evaluating only final responses is insufficient.

Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents

arXiv preprint arXiv Memory Poisoning Trust and Identity

Chunxiao Wang

Published 2026-05-11

Venue: arXiv

Open Source Record

Abstract

Production LLM coding agents drift over long sessions: they forget user-specified constraints, slip into mistakes the user already flagged, and confabulate prior agreements. White-box approaches such as persona vectors require model weights and so cannot be applied to closed APIs (Claude, GPT-4) that most users actually interact with. We present Nautilus Compass, a black-box persona drift detector and agent memory layer for production coding agents. The method operates entirely at the prompt-text layer: cosine similarity between user prompts and behavioral anchor texts, aggregated by a weighted top-k mean using BGE-m3 embeddings. Compass is, to our knowledge, the only public agent memory layer (among Mem0, Letta, Cognee, Zep, MemOS, smrti verified May 2026) that does not call an LLM at index time to extract facts or build a graph; raw conversation text is embedded directly. The system ships as a Claude Code plugin, an MCP 2024-11-05 A2A server (Cursor, Cline, Hermes), a CLI, and a REST API on one daemon, with a Merkle-chained audit log for tamper-evident anchor updates. On a held-out test set built from real Claude Code session traces and labeled by an independent LLM judge, Compass reaches ROC AUC 0.83 for drift detection. The embedded retrieval pipeline scores 56.6% on LongMemEval-S v0.8 and 44.4% on EverMemBench-Dynamic (n=500), topping the four published EverMemBench Table 4 baselines. LongMemEval-S 56.6% is ~30 points below recent white-box leaders (90+%); we treat that as the architectural ceiling of the no-extraction design. End-to-end reproduction cost is $3.50 (~14x cheaper than GPT-4o-judged stacks). A paired cross-vendor behavior A/B accompanies these numbers as preliminary system-level evidence. Code, anchors, frozen test data, and audit-log tooling are MIT-licensed at github.com/chunxiaoxx/nautilus-compass.

Bullet Summary

  • Nautilus Compass addresses the challenge of persona drift in production large language model (LLM) coding agents by detecting behavioral inconsistencies during long sessions without access to model internals, enabling deployment on closed API models like Cl...
  • The method operates entirely in a black-box manner at the prompt-text layer by computing cosine similarities between user prompts and curated positive and negative 'behavioral anchors' using BGE-m3 sentence embeddings, with a weighted top-k mean aggregation...
  • Nautilus Compass significantly improves drift detection accuracy, achieving a ROC AUC of 0.83 on a held-out test set derived from real Claude Code session traces, outperforming existing black-box memory systems in benchmarks such as LongMemEval-S and EverMe...
  • Key innovations include authoring task-shaped anchors matching user prompt grammatical structure to boost detection from near-random to strong performance, iterative refinement with hard false-positive examples, and adoption of weighted top-k mean scoring f...
  • The system is open-source with reproducible code, anchors, frozen evaluation data, and audit-log tooling under an MIT license, facilitating transparency, extensibility, and community-driven improvement in black-box LLM agent memory and persona management.

ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

Semantic Scholar · Semantic Scholar scholarly work Semantic Scholar Benchmarks and Evaluation Memory Poisoning

Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xiangyu Qi, Eric Wallace

Published 2026-05-11

Venue: Semantic Scholar

Open Source Record

Abstract

AI agents are rapidly gaining capabilities that could significantly reshape cybersecurity, making rigorous evaluation urgent. A critical capability is exploitation: turning a vulnerability, which is not yet an attack, into a concrete security impact, such as unauthorized file access or code execution. Exploitation is a particularly challenging task because it requires low-level program reasoning (e.g., about memory layout), runtime adaptation, and sustained progress over long horizons. Meanwhile, it is inherently dual-use, supporting defensive workflows while lowering the barrier for offense. Despite its importance and diagnostic value, exploitation remains under-evaluated. To address this gap, we introduce ExploitGym, a large-scale, diverse, realistic benchmark on the exploitation capabilities of AI agents. Given a program input that triggers a vulnerability, ExploitGym tasks agents with progressively extending it into a working exploit. The benchmark comprises 898 instances sourced from real-world vulnerabilities across three domains, including userspace programs, Google's V8 JavaScript engine, and the Linux kernel. We vary the security protections applied to each instance, isolating their impact on agent performance. All configurations are packaged in reproducible containerized environments. Our evaluation shows that while exploitation remains challenging, frontier models can successfully exploit a non-trivial fraction of vulnerabilities. For example, the strongest configurations are Anthropic's latest model Claude Mythos Preview and OpenAI's GPT-5.5, which produce working exploits for 157 and 120 instances, respectively. Notably, even with widely used defenses enabled, models retain non-trivial success rates. These results establish ExploitGym as an effective testbed for exploitation and highlight the growing cybersecurity risks posed by increasingly capable AI agents.

Bullet Summary

  • ExploitGym addresses the under-evaluated task of exploitation, where AI agents turn known vulnerabilities into concrete attacks like unauthorized access or code execution.
  • The benchmark includes 898 real-world vulnerability instances from userspace programs, Google's V8 JavaScript engine, and the Linux kernel, providing a diverse and realistic testing ground.
  • ExploitGym requires AI agents to incrementally extend a triggering input into a functional exploit, demanding sophisticated low-level program reasoning, runtime adaptation, and long-term planning.
  • The benchmark tests the impact of different security protections on exploitation success, with each scenario provided in reproducible containerized environments for rigorous evaluation.
  • Evaluations reveal that advanced AI models, such as Anthropic's Claude Mythos Preview and OpenAI's GPT-5.5, can successfully exploit a significant subset of vulnerabilities, demonstrating real risks.

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Merged record merged scholarly record arXiv Memory Poisoning Trust and Identity Agent-to-Agent Communication

Ben Kereopa-Yorke, Guillermo Diaz, Holly Wright, Reagan Johnston, Ron F. Del Rosario, Timothy Lynar

Published 2026-05-10

Venue: arXiv

Open Source Record

Abstract

We define Oracle Poisoning, an attack class in which an adversary corrupts a structured knowledge graph that AI agents query at runtime via tool-use protocols, causing incorrect conclusions through correct reasoning. Unlike prompt injection, Oracle Poisoning manipulates the data agents reason over, not their instructions. We demonstrate six attack scenarios against a production 42-million-node code knowledge graph, providing the first empirical demonstration of knowledge graph poisoning against a production-scale agentic system, distinct from CTI embedding poisoning. Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results. The result is unambiguous: every tested model trusts poisoned data at 100% at moderate attacker sophistication(L2), with 269 valid trials (of 270) accepting fabricated security claims under directed queries. Under open-ended prompts, trust drops to 3-55%, confirming prompt framing as a confound; we report both conditions. An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much. A controlled delivery-mode comparison shows that inline evaluation produces false negatives: GPT-5.1 shows 0% trust inline but 100% under both simulated and real agentic tool-use, demonstrating that delivery mode is a first-order confound. We evaluate five defences; read-only access control eliminates the direct mutation vector, while the remaining four are partial and model-dependent. Analysis of four additional platforms suggests the attack may generalise across the knowledge-graph ecosystem.

Bullet Summary

  • Oracle Poisoning is a novel attack class where adversaries corrupt structured knowledge graphs queried by AI agents at runtime, causing incorrect conclusions through accurate reasoning over falsified data.
  • This attack differs fundamentally from prompt injection and retrieval-augmented generation (RAG) poisoning by targeting the knowledge graph data itself rather than AI model instructions or retrieval texts.
  • Empirical evaluation involving nine large language models (LLMs) from three major providers shows universal 100% trust in poisoned graph data under real SDK tool-use scenarios at moderate attacker sophistication (L2), demonstrating AI agents' blind trust in...
  • Six attack scenarios illustrate the severity of Oracle Poisoning across applications like supply chain attacks, security evasion, and code generation, exploiting AI agents' implicit trust in graph query results in production-scale knowledge graphs with tens...
  • Delivery mode critically influences susceptibility assessments; inline evaluation underestimates vulnerability compared to real agentic tool-use pipelines, as shown by GPT-5.1 trusting poisoned data only under agentic API calls.

Portable Agent Memory: A Protocol for Cryptographically-Verified Memory Transfer Across Heterogeneous AI Agents

arXiv preprint arXiv Prompt Injection Memory Poisoning Trust and Identity

Santhosh Kumar Ravindran

Published 2026-05-10

Venue: arXiv

Open Source Record

Abstract

We present Portable Agent Memory, an open protocol and reference implementation for transferring persistent memory state across heterogeneous AI agents. Modern AI agents accumulate rich context -- episodic events,semantic knowledge, procedural skills, working state, and identity preferences -- but this context remains locked within vendor-specific runtimes. Portable Agent Memory addresses this through: (1) a five-component structured memory model with content-addressable entries linked by a Merkle-DAG provenance graph providing tamper-evidence; (2) capability-based access control enabling selective, scoped disclosure of memory segments; (3) an injection-resistant rehydration protocol that adapts recalled content to heterogeneous target models while mitigating indirect prompt injection; and (4) a JSON-first serialization format with optional CBOR compaction for efficient transport. We provide a Python SDK with 54 passing tests, agent skills for multiple platforms, and demonstrate cross-model memory transfer between GPT-4, Claude, Gemini, and Llama architectures. The protocol is open-source under Apache 2.0.

Bullet Summary

  • Introduces Portable Agent Memory (PAM), an open protocol enabling secure, portable transfer of persistent memory across heterogeneous AI agents to overcome vendor lock-in and session amnesia.
  • Defines a comprehensive five-component memory model capturing episodic, semantic, procedural, working, and identity memory components essential for agent context preservation.
  • Utilizes a Merkle-DAG provenance graph with content-addressable entries (using BLAKE3 hashes) to ensure cryptographic tamper-evidence and integrity verification of memory data.
  • Implements capability-based access control with cryptographically signed, scoped tokens to allow fine-grained, selective, and revocable memory sharing in multi-agent environments.
  • Develops an injection-resistant re-hydration protocol that verifies memory integrity, filters by access tokens, ranks relevance, compresses data, and frames context to prevent prompt injections during cross-model memory integration.

Position: AI Security Policy Should Target Systems, Not Models

arXiv preprint arXiv Prompt Injection Memory Poisoning Agent-to-Agent Communication

Michael A. Riegler, Inga Strümke

Published 2026-05-10

Venue: arXiv

Open Source Record

Abstract

We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software vulnerability discovery, i.e., the capability class that motivated restricted release of Anthropic's Mythos Preview, are achievable at effectively zero cost using commodity hardware and openly available models. We report two experiments. In the first, five instances of a 1.2 billion parameter model conducted 225 jailbreak attacks each against GPT-4o and Claude Sonnet~4. Against GPT-4o, the swarm achieved an Effective Harm Rate of 45.8%, producing 49 critical-severity breaches; against Claude Sonnet-4, the Effective Harm Rate was 0% despite a 40% technical success rate. In the second experiment, the same models performed combined source code analysis and binary fuzzing against a vulnerable C application with 9 planted CWEs. With a hand-crafted exploit seed corpus, regex pattern detection, and AddressSanitizer-based crash classification, the pipeline recovers 9 of 9 vulnerabilities (100% recall) in approximately four minutes on a consumer MacBook. With those scaffold components disabled, the same model recovers 0 of 9 by crash verification and 2 of 9 by citation. The capability class that motivated restricted release of Anthropic's Mythos Preview is therefore reproducible at effectively zero cost; the important enabler is the system scaffold itself, which compensates for the limited reasoning capacity of small individual models.

Bullet Summary

  • AI security policies should prioritize assessing system-level capabilities rather than imposing access restrictions on individual frontier language models, as offensive capabilities mainly emerge from multi-agent frameworks surrounding these models.
  • The paper introduces swarm-attack, an open-source framework where multiple lightweight LLM agents collaborate via shared memory and evolutionary optimization, demonstrating that safety bypass and software vulnerability discovery can be performed effectively...
  • Experiment one showed that five instances of a 1.2 billion parameter model coordinated to carry out 225 jailbreak attempts each against GPT-4o, achieving a 45.8% Effective Harm Rate and causing 49 critical security breaches, while the same approach had no h...
  • In the second experiment, the same multi-agent system, combined with regex pattern detection and AddressSanitizer instrumentation, successfully identified all nine planted vulnerabilities in a vulnerable C application within minutes on a consumer-grade MacB...
  • Results underline that offensive AI threats depend more on system design, coordination strategies, and architectural scaffolds than on the size or inherent capability of any single large language model.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

arXiv preprint arXiv Memory Poisoning Benchmarks and Evaluation

Yang Luo, Zifeng Kang, Tiantian Ji, Xinran Liu, Yong Liu, Shuyu Li, Lingyun Peng

Published 2026-05-09

Venue: arXiv

Open Source Record

Abstract

Graph-based agent memory is increasingly used in LLM agents to support structured long-term recall and multi-hop reasoning, but it also creates a new poisoning surface: an attacker can inject a crafted relation into graph memory so that it is later retrieved and influences agent behavior. Existing agent-memory poisoning attacks mainly target flat textual records and are ineffective in graph-based memory because malicious relations often fail to be extracted, merged into the target anchor neighborhood, or retrieved for the victim query. We present SHADOWMERGE, a poisoning attack against graph-based agent memory that exploits relation-channel conflicts. Its key insight is that a poisoned relation can share the same query-activated anchor and canonicalized relation channel as benign evidence while carrying a conflicting value. To realize this, we design AIR, a pipeline that converts the conflict into an ordinary interaction that can be extracted, merged, and retrieved by the graph-memory system. We evaluate SHADOWMERGE on Mem0 and three public real-world datasets: PubMedQA, WebShop, and ToolEmu. SHADOWMERGE achieves 93.8% average attack success rate, improving the best baseline by 50.3 absolute points, while having negligible impact on unrelated benign tasks. Mechanism studies show that SHADOWMERGE overcomes the three key limitations of existing agent-memory poisoning attacks, and defense analysis shows that representative input-side defenses are insufficient to mitigate it. We have responsibly disclosed our findings to affected graph-memory vendors and open sourced SHADOWMERGE.

Bullet Summary

  • Utility Agent reasoning GPT-5.5 reference 0.913 +0.000 0.960 Agent reasoning DeepSeek-V4 Pro 0.967 +0.054 0.889 Agent reasoning Gemini-3.1 Pro 0.953 +0.040 0.939 Graph extraction GPT-5.5 reference 0.913 +0.000 0.960 Graph extraction DeepSeek-V4 Pro 0.860 -0...
  • Despite these advantages, graph-based agent memory also introduces a new attack surface in which an attacker can craft a poisoned relation that is extracted into graph memory and later retrieved to influence subsequent agent behavior.
  • This read–write loop lets an agent carry experience across sessions, but it also turns memory construction into a security-relevant component of the agent pipeline.
  • However, this attack surface remains unexplored in existing security research on agent-memory poisoning, as existing attacks mainly target flat textual records and are ineffective in graph-based agent memory because malicious relations often fail to be extr...
  • However, graph-based agent memory also introduces a new attack surface.

When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

arXiv preprint arXiv Trust and Identity Memory Poisoning Benchmarks and Evaluation

Strick Sheng, Ziyue Wang, Liyi Zhou

Published 2026-05-09

Venue: arXiv

Open Source Record

Abstract

Large language model agents increasingly operate through environment-facing scaffolds that expose files, web pages, APIs, and logs. These observations influence tool use, state tracking, and action sequencing, yet their reliability and authority are often uncertain. Environmental grounding is therefore a systems-level problem involving context admission, evidence provenance, freshness checking, verification policy, action gating, and model reasoning. Existing agent benchmarks mainly evaluate task capability or specific attacks such as prompt injection and memory poisoning, but they under-specify a fundamental reliability question: whether agents remain grounded in the true environment state when observations are stale, incorrect, or malicious. We introduce EnvTrustBench, an agentic framework for benchmarking this failure mode. We define an evidence-grounding defect (EGD) as a behavioral failure in which an agent treats an environment-facing claim as sufficient evidence for action without resolving it against available current evidence, leading to a task-incorrect false path under the true environment state. Given a task scenario, EnvTrustBench generates the workspace, environment, agent-facing objective, and validation oracle, executes the evaluated agent, records its action-observation trajectory and final state, and applies the oracle to produce a verdict. Using 6 LLM backbones and 5 widely used scaffolds, we evaluate 55 generated cases across 11 task scenarios, with each scenario expanded through five feedback-guided generation iterations. Results show that EGDs consistently emerge across operational workflows, highlighting environmental grounding as a core agent reliability problem with important security implications.

Bullet Summary

  • Large language model (LLM) agents rely on environmental evidence from files, APIs, logs, etc., which can be stale, incorrect, or malicious, posing reliability and security risks.
  • An Evidence-Grounding Defect (EGD) is identified as a critical failure mode where an agent overtrusts environmental claims without verifying them against current evidence, leading to incorrect actions.
  • EnvTrustBench is introduced as an extensible agentic benchmarking framework that generates task scenarios, controlled environments, and validation oracles to systematically detect and evaluate EGDs in LLM agents.
  • The framework models agent runtime state and assumes a threat model where environmental claims can be manipulated adversarially, while workspace and tools remain trusted.
  • Extensive evaluation using 6 LLM backbones, 5 scaffolds, and 11 task scenarios (3,850 trials) reveals a high Environmental Misgrounding Rate (EMR) averaging 83.3%, showing EGDs are widespread and a significant reliability problem.

Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification

arXiv preprint arXiv Trust and Identity Governance and Policy Memory Poisoning

Sarah Wilson, Diem Linh Dang, Usman Ali Moazzam, Shan Ye, Gail Kaiser

Published 2026-05-08

Venue: arXiv

Open Source Record

Abstract

Autonomous AI agents are increasingly deployed in open social environments, yet the relationship between their configuration specifications and their emergent social behavior remains poorly understood. We present a controlled, multi-factor empirical study in which thirteen OpenClaw agents are deployed on Moltbook -- a Reddit-like social network built for AI agents -- across three systematically varied independent variables: (1) personality specification via SOUL.md, (2) underlying LLM model backbone, and (3) operational rules and memory configuration via AGENTS.md. A default control agent provides a behavioral baseline. Over a one-week observation window spanning approximately 400 autonomous sessions per agent, we collect behavioral, linguistic, and social metrics to assess how configuration layers predict emergent social behavior. We find that personality specification is the dominant behavioral lever, producing a massive spread in response length across agents, while model backbone and operational rules drive more moderate but still meaningful effects on rhetorical style and topic engagement breadth. Our findings contribute empirical evidence to the emerging literature on deployed multi-agent social systems and offer practical guidance for designing agents intended for collaborative or monitoring tasks in real social environments.

Bullet Summary

  • The paper investigates how different configuration layers—personality specification, LLM model backbone, and operational rules/memory—affect the emergent social behavior of autonomous AI agents deployed in a realistic social network environment.
  • Experiments were conducted using thirteen OpenClaw agents on Moltbook, a Reddit-like AI social platform, across around 400 autonomous sessions per agent over one week, collecting behavioral, linguistic, and social engagement data.
  • Personality specification via SOUL.md exerts the strongest influence on agent behavior, notably impacting response length, verbosity, and rhetorical style, aligning with distinct personality profiles (Big Five/OCEAN and MBTI).
  • Model backbone choice significantly affects rhetorical style, contradiction rates, and topic engagement breadth; however, higher model tier does not consistently predict more verbose or complex behavior, highlighting provider and model-specific effects.
  • Operational rules and memory persistence modulate agent autonomy and topic exploration, with agents having full autonomy showing more inquisitive behavior and those without memory surprisingly exploring broader topics despite limited autonomy.

When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks

arXiv preprint arXiv Memory Poisoning Trust and Identity Governance and Policy

Ziwen Cai, Yihe Zhang, Xiali Hei

Published 2026-05-08

Venue: arXiv

Open Source Record

Abstract

Since the official release of ChatGPT in 2022, large language models (LLMs) have rapidly evolved from chatbot-style interfaces into agentic systems that can delegate work through tools and newly spawned subagents. While these capabilities improve automation and scalability, they also pose new security risks in multi-agent networks. Existing research has studied how individual LLM-based agents can be compromised through prompt injection, jailbreaking, poisoned retrieval data, or malicious extensions. Less is known about what happens after one agent is compromised inside a multi-agent network. In particular, inherited memory from parent agents can carry malicious instructions, outdated states, or unintended behavioral rules into newly created subagents, allowing a local compromise to spread across agent boundaries. In this paper, we model contemporary multi-agent networks through the lens of subagent inheritance. Our analysis shows that current frameworks can violate trust boundaries through insecure memory inheritance, weak resource control, stale post-spawn state, and improper termination authority. We demonstrate these risks in real agent frameworks and propose defenses based on explicit security invariants. Our findings show that inheritance is not merely an implementation detail, but a central component influencing the security of multi-agent systems.

Bullet Summary

  • Multi-agent networks leveraging large language models (LLMs) now feature agents that can spawn subagents, which inherit memory and capabilities, introducing complex security vulnerabilities.
  • Compromised parent agents can propagate malicious instructions, outdated states, or unintended behaviors to subagents through insecure memory inheritance, facilitating attack spread across agent boundaries.
  • The paper models multi-agent networks focusing on subagent inheritance, identifying key vulnerabilities including unauthorized sibling termination, stale post-spawn states, weak resource access control, and improper agent termination authority.
  • A formal role-based multi-agent network model is proposed, defining agent roles, capabilities, memory inheritance modes, lifespan modes, and interaction modes to systematize security policy enforcement.
  • Targeted defenses include an Agent Capability Registry to enforce role-scoped tool access and termination permissions, role-scoped memory projection to limit inherited context, and synchronization protocols to manage memory state divergence among agents.

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

arXiv preprint arXiv Memory Poisoning Prompt Injection Benchmarks and Evaluation

Jun Wen Leong

Published 2026-05-08

Venue: arXiv

Open Source Record

Abstract

Persistent memory attacks against LLM agents achieve high attack success rates against open-source models. In these attacks, malicious instructions injected via RAG-retrieved documents are stored in persistent memory and executed in later sessions. However, no systematic evaluation of defense effectiveness against this attack class exists. We evaluate six defenses across four architectural layers against delayed-trigger attacks on nine open-source models (5,040 runs, N=40 per condition). Four defenses fail at approximately baseline attack success rate: input-level filtering (Minimizer, Sanitizer) and retrieval-level filtering (RAG Sanitizer, RAG LLM Judge) achieve 88-89% ASR, statistically indistinguishable from the undefended baseline of 88.6%. Prompt Hardening partially fails at 77.8% ASR, with the reduction driven by two models at 0%: one genuine defense effect and one model-level refusal independent of the defense. The architectural explanation holds: input-level defenses cannot observe RAG-injected content, and retrieval-level classifiers are defeated by compliance-framed semantic masking. One defense, tool-gating at the memory layer (Memory Sandbox), reduces ASR to 0% for eight of nine models by removing the recall capability the attack requires. The exception inverts the defense entirely: a reasoning model that achieves 0% ASR under no defense via execution refusal inverts to 100% ASR under Memory Sandbox, because removing explicit recall forces the model onto the RAG pathway where its refusal mechanism does not activate. Memory Sandbox imposes zero utility cost in the absence of attack (BTCR = 100% across all conditions). These results provide the first systematic characterization of why each defense class fails against persistent memory attacks, enabling informed defense investment decisions.

Bullet Summary

  • Persistent memory attacks exploit malicious instructions embedded via RAG-retrieved documents, stored in stateful LLM agent memory, enabling delayed execution and evading traditional input and retrieval level defenses.
  • A systematic evaluation was conducted across six defenses at four architectural layers against delayed-trigger attacks on nine open-source LLMs, involving 5,040 experimental runs with rigorous statistical methodology.
  • Input-level defenses (Minimizer, Sanitizer) and retrieval-level defenses (RAG Sanitizer, RAG LLM Judge) failed to reduce attack success rates, maintaining around 88-89% success, due to architectural blind spots and semantic masking bypass.
  • Prompt Hardening reduced attack success rates moderately to 77.8%, but this was uneven and partially driven by two models exhibiting complete defense via refusal or genuine robustness, indicating limited general efficacy.
  • The Memory Sandbox defense at the memory layer, implementing tool-gating to remove recall capability, effectively eliminated attack success (0%) in eight of nine models, without incurring utility costs when attacks were absent.

What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook

arXiv preprint arXiv Trust and Identity Memory Poisoning Governance and Policy

Junyu Huo, Ziqi Mao, Zihao Wan, Gouri Ginde

Published 2026-05-08

Venue: arXiv

Open Source Record

Abstract

AI agents are increasingly framed as software-engineering teammates, yet most research studies them inside human-centered workflows. Little is known about the software-engineering discourse autonomous AI agents produce when they interact primarily with one another. This paper examines what autonomous AI agents discuss in MoltBook, an AI-agents-only social network, how that discourse is organized, and how it differs from human developer discourse. We combine human open coding of a 500-post sample, a concentration-plus-check topic-analysis pipeline over 4,707 English-filtered MoltBook technology posts, and a matched-instrument comparison against 5,211 GitHub Discussions posts. MoltBook technology discourse spans 12 recurring themes and is led by Security and Trust (27.4%). At the community level, activity is highly concentrated: the largest submolt contains 63.5% of posts and the Gini coefficient is 0.88, yet a stability-aware BERTopic pipeline still yields 32 non-outlier sub-topics. Compared with the GitHub Discussions baseline, MoltBook discourse contains fewer concrete, context-rich cues such as code-formatted artifacts, environment details, runtime failures, and reproduction steps; social mimicry appears only in a limited way, while idealization is mainly reflected through lower hedging. Overall, AI-only technical discourse is coherent but selective. It repeatedly returns to concerns such as security and trust, memory and context management, tooling and APIs, debugging and error handling, workflow automation, and infrastructure/ops, while omitting much of the concrete runtime and project-local detail common in human developer discourse. This may be because MoltBook contains fewer environment-specific failures, reproduction steps, and other concrete grounding cues.

Bullet Summary

  • The paper investigates software engineering discourse generated exclusively by autonomous AI agents on MoltBook, an AI-agents-only social network, addressing a gap since prior research mostly focuses on human-centered workflows involving AI.
  • The study employs a mixed-methods approach including human open coding of a sample of 500 MoltBook posts, a topic modeling pipeline on 4,707 English-filtered posts, and a comparative analysis against 5,211 human posts from GitHub Discussions to understand t...
  • MoltBook AI discourse revolves around 12 recurring themes, with Security and Trust constituting the largest share (27.4%), reflecting AI agents' prioritization of protection and trustworthiness in software engineering discussions.
  • Activity on MoltBook is highly concentrated; the largest subcommunity holds 63.5% of posts and the Gini coefficient of 0.88 indicates skewed participation, yet topic modeling reveals 32 non-outlier subtopics, showing topical diversity within concentrated ac...
  • Compared to human developers' GitHub Discussions, AI agent discourse on MoltBook contains fewer concrete, context-rich cues such as code snippets, environment details, runtime failures, and reproduction steps, indicating less situational grounding.

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

arXiv preprint arXiv Memory Poisoning Trust and Identity Agent-to-Agent Communication

Jiayuan Liu, Tianqin Li, Shiyi Du, Xin Luo, Haoxuan Zeng, Emanuel Tewolde, Tai Sing Lee, Tonghan Wang

Published 2026-05-08

Venue: arXiv

Open Source Record

Abstract

Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.

Bullet Summary

  • Expanding the accessible memory (history length) of large language model (LLM) agents in multi-agent repeated social dilemmas often degrades cooperation, a phenomenon termed the 'memory curse.'
  • Experiments involving 7 LLMs, 4 social dilemma games, and 500 interaction rounds revealed that increased memory reduced cooperation in 18 out of 28 model-game scenarios, although some models were immune to this effect.
  • Lexical and behavioral analyses linked the memory curse primarily to eroded forward-looking cooperative intent rather than heightened paranoia or suspicion.
  • Fine-tuning a memory-cursed LLM on forward-looking reasoning traces improved cooperation and generalized zero-shot across different games, confirming a causal role for strategic cognition.
  • Memory sanitization—replacing real historical interactions with synthetic cooperative records—restored cooperation significantly, demonstrating that the content of memory, not just its length, causes cooperation breakdown.

Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs

arXiv preprint arXiv Agent-to-Agent Communication Orchestration Risk Memory Poisoning

Hanlin Cai, Kai Li, Houtianfu Wang, Haofan Dong, Yichen Li, Falko Dressler, Ozgur B. Akan

Published 2026-05-08

Venue: arXiv

Open Source Record

Abstract

Federated fine-tuning (FFT) has emerged as a privacy-preserving paradigm for collaboratively adapting large language models (LLMs). Built upon federated learning, FFT enables distributed agents to jointly refine a shared pretrained LLM by aggregating local LLM updates without sharing local raw data. However, FFT-based LLMs remain vulnerable to model manipulation threats, in which adversarial participants upload manipulated LLM updates that corrupt the aggregation process and degrade the performance of the global LLM. In this paper, we propose an Augmented Model maniPulation (AugMP) strategy against FFT-based LLMs. Specifically, we design a novel graph representation learning framework that captures feature correlations among benign LLM updates to guide the generation of malicious updates. To enhance manipulation effectiveness and stealthiness, we develop an iterative manipulation algorithm based on an augmented Lagrangian dual formulation. Through this formulation, malicious updates are optimized to embed adversarial objectives while preserving benign-like parameter characteristics. Experimental results across multiple LLM backbones demonstrate that the AugMP strategy achieves the strongest manipulation performance among all competing baselines, reducing the global LLM accuracy by up to 26% and degrading the average accuracy of local LLM agents by up to 22%. Meanwhile, AugMP maintains high statistical and geometric consistency with benign updates, enabling it to evade conventional distance- and similarity-based defense methods.

Bullet Summary

  • Federated fine-tuning (FFT) enables collaborative adaptation of large language models (LLMs) by aggregating local model updates while preserving data privacy, but remains vulnerable to adversarial manipulation where malicious agents degrade global model per...
  • The paper introduces Augmented Model maniPulation (AugMP), a novel attack leveraging graph representation learning (GRL) to model correlations among benign LLM updates, allowing the generation of malicious updates that retain benign-like statistical and geo...
  • AugMP employs a variational graph autoencoder and graph spectral transformation to learn feature and topological relationships among benign updates, expanding the manipulation space while preserving stealthiness.
  • An augmented Lagrangian dual optimization method with quadratic penalties is formulated to iteratively craft malicious updates that embed adversarial objectives while maintaining similarity constraints, enhancing attack effectiveness and stealthiness.
  • Experimental evaluation on multiple LLM backbones (DistilBERT, Pythia, Qwen2.5) and datasets (AG News, Yahoo! Answers) shows AugMP significantly reduces global and local LLM accuracy by up to 26% and 22%, outperforming existing attacks like ALIE and RMP.

Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs

Semantic Scholar · Semantic Scholar scholarly work Semantic Scholar Orchestration Risk Memory Poisoning Benchmarks and Evaluation

Hanlin Cai, Kai Li, Houtianfu Wang, Haofan Dong, Yichen Li, Falko Dressler, Ozgur B. Akan

Published 2026-05-08

Venue: Semantic Scholar

Open Source Record

Abstract

Federated fine-tuning (FFT) has emerged as a privacy-preserving paradigm for collaboratively adapting large language models (LLMs). Built upon federated learning, FFT enables distributed agents to jointly refine a shared pretrained LLM by aggregating local LLM updates without sharing local raw data. However, FFT-based LLMs remain vulnerable to model manipulation threats, in which adversarial participants upload manipulated LLM updates that corrupt the aggregation process and degrade the performance of the global LLM. In this paper, we propose an Augmented Model maniPulation (AugMP) strategy against FFT-based LLMs. Specifically, we design a novel graph representation learning framework that captures feature correlations among benign LLM updates to guide the generation of malicious updates. To enhance manipulation effectiveness and stealthiness, we develop an iterative manipulation algorithm based on an augmented Lagrangian dual formulation. Through this formulation, malicious updates are optimized to embed adversarial objectives while preserving benign-like parameter characteristics. Experimental results across multiple LLM backbones demonstrate that the AugMP strategy achieves the strongest manipulation performance among all competing baselines, reducing the global LLM accuracy by up to 26% and degrading the average accuracy of local LLM agents by up to 22%. Meanwhile, AugMP maintains high statistical and geometric consistency with benign updates, enabling it to evade conventional distance- and similarity-based defense methods.

Bullet Summary

  • Federated fine-tuning (FFT) allows multiple distributed agents to collaboratively adapt a shared large language model (LLM) without sharing raw data, preserving privacy.
  • FFT-based LLMs are vulnerable to model manipulation attacks where adversarial participants upload malicious updates to degrade the global model's performance.
  • The paper introduces Augmented Model maniPulation (AugMP), a novel attack strategy leveraging a graph representation learning framework to capture correlations among benign LLM updates.
  • AugMP uses this graph-based representation to guide the generation of malicious model updates that closely mimic benign update characteristics.
  • An iterative manipulation algorithm based on an augmented Lagrangian dual formulation is developed to optimize malicious updates for both effectiveness and stealthiness.

Towards Security-Auditable LLM Agents: A Unified Graph Representation

Merged record merged scholarly record arXiv Memory Poisoning Agent-to-Agent Communication

Chaofan Li, Lyuye Zhang, Jintao Zhai, Siyue Feng, Xichun Yang, Huahao Wang, Shihan Dou, Yu Ji

Published 2026-05-07

Venue: arXiv

Open Source Record

Abstract

LLM-based agentic systems are rapidly evolving to perform complex autonomous tasks through dynamic tool invocation, stateful memory management, and multi-agent collaboration. However, this semantics-driven execution paradigm creates a severe semantic gap between low-level physical events and high-level execution intent, making post-hoc security auditing fundamentally difficult. Existing representation mechanisms, including static SBOMs and runtime logs, provide only fragmented evidence and fail to capture cognitive-state evolution, capability bindings, persistent memory contamination, and cascading risk propagation across interacting agents. To bridge this gap, we propose Agent-BOM, a unified structural representation for agent security auditing. Agent-BOM models an agentic system as a hierarchical attributed directed graph that separates static capability bases, such as models, tools, and long-term memory, from dynamic runtime semantic states, such as goals, reasoning trajectories, and actions. These layers are connected through semantic edges and security attributes, transforming fragmented execution traces into queryable audit paths. Building on Agent-BOM, we develop a graph-query-based paradigm for path-level risk assessment and instantiate it with the OWASP Agentic Top 10. We further implement an auditing plugin in the OpenClaw environment to construct Agent-BOM from live executions. Evaluation on representative real-world agentic attack scenarios shows that Agent-BOM can reconstruct stealthy attack chains, including cross-session memory poisoning and tool misuse, capability supply-chain hijacking and unexpected code execution, multi-agent ecosystem hijacking, and privilege and trust abuse. These results demonstrate that Agent-BOM provides a unified and auditable foundation for root-cause analysis and security adjudication in complex agentic ecosystems.

Bullet Summary

  • LLM-based multi-agent systems autonomously perform complex tasks but create a semantic gap between low-level events and high-level intent, challenging post-hoc security auditing.
  • Existing tools such as SBOMs and runtime logs offer fragmented data insufficient for capturing cognitive state evolution, capability bindings, memory contamination, and risk propagation across agents.
  • Agent-BOM is introduced as a unified hierarchical attributed directed graph representation separating static capabilities and dynamic runtime semantic states, linked via semantic edges and security attributes for comprehensive auditing.
  • A graph-query-based auditing paradigm is developed using Agent-BOM for path-level risk assessment, instantiated with the OWASP Agentic Top 10 security risks, supporting entry localization, backward/forward tracing, and adjudication.
  • Agent-BOM is implemented within the OpenClaw environment, enabling real-time construction of audit graphs from live multi-agent executions.

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

arXiv preprint arXiv Memory Poisoning Benchmarks and Evaluation

Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, Yushi Sun

Published 2026-05-07

Venue: arXiv

Open Source Record

Abstract

Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.

Bullet Summary

  • Introduces STALE, a benchmark designed to evaluate Large Language Model agents' ability to detect and resolve implicit conflicts in long-term memory, where new observations invalidate prior beliefs without explicit negation.
  • Identifies and categorizes implicit conflicts into Type I (co-referential attribute updates) and Type II (propagated conflicts through attribute dependencies), highlighting the challenge in latent belief revision.
  • Proposes a three-dimensional evaluation framework assessing State Resolution (detecting outdated beliefs), Premise Resistance (rejecting queries based on stale assumptions), and Implicit Policy Adaptation (applying updated beliefs in behavior).
  • Conducts extensive experiments benchmarking various state-of-the-art closed and open-source LLMs and memory frameworks, revealing poor performance with the best model achieving only 55.2% overall accuracy, demonstrating a significant gap in handling implici...
  • Finds that models can recognize outdated information but often fail to adapt responses or reject queries that presuppose stale information, especially in the more complex propagated (Type II) conflicts.
Load more articles