Evaluating Current Large Language Model Sandboxing Methods Against Latent Vulnerabilities from Adversarial Multimodal Prompts

1. Executive Summary

This report evaluates the efficacy of current sandboxing methodologies for Large Language Models (LLMs) and identifies latent vulnerabilities that become exploitable through adversarial multimodal prompts. The analysis reveals that existing sandbox solutions primarily concentrate on mitigating risks associated with LLM-generated code, often inadequately addressing threats embedded within or delivered via complex multimodal inputs. For instance, standard evaluation suites like SandboxEval, with its 51 test cases focused on code execution in Linux environments, do not currently cover threats such as image-in-image steganography or other multimodal attack vectors.1 The advent of Multimodal Large Language Models (MLLMs) introduces novel attack vectors where adversarial prompts, combining modalities such as text, image, and audio, can circumvent traditional input validation mechanisms and exploit the intricate processing pipelines (e.g., fusion layers, Optical Character Recognition (OCR) filters) inherent to these advanced models.

Key findings indicate that latent vulnerabilities manifest in several critical areas. These include the compromise of an MLLM’s perceptual capabilities leading to the misuse of tools and resources operating within sandboxed environments, the establishment of covert channels for data exfiltration that bypass conventional monitoring, and the circumvention of input/output controls initially designed for unimodal threats. LLM-based agents, particularly those interacting with Graphical User Interfaces (GUIs) or external systems through multimodal inputs, represent a significantly vulnerable interface where agent action validators may be bypassed.

A substantial gap exists between the capabilities of current sandboxing technologies and their ability to defend against sophisticated multimodal attacks. This necessitates targeted changes in components like MLLM fusion layers, input sanitization pipelines (including OCR and audio transcription filters), and agent action validation logic. High-level recommendations include the urgent integration of OCR and steganalysis modules into MLLM ingestion pipelines, the pursuit of more robust MLLM architectural designs resilient to cross-modal manipulation, and the implementation of enhanced monitoring and anomaly detection systems capable of interpreting cross-modal threats.

2. Introduction: The Evolving Threat Landscape of LLM Sandboxing and Multimodal Inputs

The rapid proliferation of Large Language Models (LLMs) has introduced a dual challenge to the cybersecurity landscape. On one hand, LLMs are increasingly utilized as powerful tools for complex tasks such as code generation, requiring robust sandboxing mechanisms—isolated environments for executing untrusted code or containing untrusted processes—to ensure the safe execution of their potentially untrusted outputs.1 On the other hand, LLMs themselves are susceptible to a variety of attacks, including prompt injection and data poisoning, which can manipulate their behavior or compromise their outputs.1 This inherent duality means that as LLMs are integrated into systems that execute code or interact with external environments, the risk of system compromise escalates.1 Sandboxes serve as a critical, albeit often reactive, defense against these amplified risks. The more powerful and autonomous an LLM becomes, the more sophisticated its sandboxing requirements are, yet this increased complexity simultaneously broadens the potential attack surface, especially with the introduction of new input modalities.

The emergence of Multimodal Large Language Models (MLLMs), capable of processing and integrating information from diverse modalities such as text, images, audio, and video, marks a significant evolution.4 Models like GPT-4 with Vision (GPT-4V) and Gemini, which gained prominence around 2023, exemplify this shift.4 This multimodal capability enhances user intent understanding, enables richer interactions, and expands applicability across various real-world scenarios.6 However, this advancement concurrently introduces a new and potent threat vector: multimodal inputs can serve as a novel attack surface, potentially bypassing security measures designed primarily for text-only LLMs.7 While sandboxing methods for code execution have matured over several years, their adaptation to these newer multimodal threats is still in its nascent stages.

This shift towards MLLMs fundamentally alters the security assumptions that underpin many existing LLM sandboxing strategies. Traditional sandboxes are often positioned at the execution layer, designed to contain malicious code generated by an LLM.1 Multimodal attacks, however, can originate at the MLLM’s perception layer—the components responsible for initial processing and interpretation of sensory inputs (e.g., visual encoders for images, speech-to-text systems for audio).4 If an MLLM’s core reasoning is compromised by an adversarial multimodal prompt, its subsequent actions—even if technically executed within a sandbox—may already be dictated by the adversary, subverting the sandbox’s intended purpose. This report aims to evaluate current sandboxing methods in the context of these emerging multimodal threats, identifying latent vulnerabilities and exploring potential mitigation pathways.

3. Current State of LLM Sandboxing: Architectures and Mechanisms

The primary purpose of LLM sandboxes is to create isolated environments for the execution of untrusted code generated by LLMs. This isolation is critical for preventing harm to the host system or the underlying assessment infrastructure, particularly in scenarios involving LLM evaluation frameworks and LLM-based agents that autonomously generate and execute code.1

Common Sandboxing Architectures and Technologies

Several architectures and technologies are commonly employed to achieve this isolation:

  • Containers (e.g., Docker with LXC): Linux containers (LXC), often managed via Docker, provide OS-level virtualization. They achieve isolation with relatively low performance overhead compared to full virtualization, making them a popular choice.9 However, misconfigurations relevant to LLM security can occur; for instance, if an LLM is compromised to generate code that exploits a container escape vulnerability due to overly permissive settings (e.g., privileged mode), the isolation is undermined.1
  • User-Mode Kernels (e.g., gVisor): User-mode kernels like gVisor intercept system calls made by applications running within the sandbox, emulating the kernel’s behavior in user space. This approach offers stronger isolation than standard containers by reducing the attack surface exposed by the host kernel.9 gVisor is often considered a good compromise, balancing enhanced security with acceptable performance for running untrusted code.9
  • Virtual Machines (VMs) (e.g., Firecracker, Kata Containers): VMs utilize hardware virtualization to create fully isolated guest environments, each with its own kernel. Lightweight hypervisors like Firecracker (used by AWS Lambda) or technologies like Kata Containers (which run container workloads in lightweight VMs) provide robust isolation boundaries.9 While offering strong security, VMs typically incur a higher performance overhead and can introduce complexity in orchestration and resource management.9
  • Specialized Isolation Frameworks (e.g., ISOLATEGPT): Recognizing the unique challenges of LLM-based systems, researchers have proposed specialized architectures like ISOLATEGPT. This framework aims to secure the execution of LLM applications (“apps”) by isolating their execution environments and mediating all interactions with the system or other apps through well-defined interfaces, contingent on user permissions.8 The goal is to reduce the attack surface by design, though such frameworks may introduce performance overhead (under 30% for most tested queries in ISOLATEGPT).8

Evaluation of Sandbox Effectiveness: The Role of SandboxEval

To systematically assess the effectiveness of sandboxing solutions, tools like SandboxEval have been developed. SandboxEval is a test suite featuring manually crafted test cases that simulate real-world safety scenarios specifically for LLM assessment environments dealing with untrusted code execution.1 It evaluates a range of vulnerabilities, including sensitive information exposure, filesystem manipulation, unauthorized external communication, and other potentially dangerous operations that could occur during the execution of LLM-generated code.1 The suite comprises 51 distinct properties associated with malicious code execution scenarios.1 Its design is particularly tailored for use during the “scoring” or “measurement” phase of LLM assessment, where the output of an LLM (often code) is executed and evaluated.1

While SandboxEval provides a valuable resource for the research community in evaluating sandboxing for code execution, its current focus appears to be primarily on vulnerabilities arising from the execution of code within a Linux system environment.2 The test cases are designed to probe the security of a Linux system by simulating malicious code execution.2 This orientation means that SandboxEval, in its present form, may not comprehensively cover vulnerabilities that are introduced through the interpretation of adversarial multimodal inputs if these inputs lead to malicious behavior before or independent of explicit, traditional code generation and execution. For example, if a multimodal prompt causes an MLLM to directly exfiltrate data via a legitimate API call it has access to—without generating new, overtly malicious code to do so—SandboxEval’s current scope might not detect this pathway if it’s primarily searching for conventional code execution vulnerabilities. (No specific information on Windows/Azure sandbox gaps for MLLMs was found in the provided materials, so this aspect remains focused on Linux as per SandboxEval’s scope).

Production Considerations for Sandboxes

Beyond the core isolation technology, several production considerations are vital for the secure and stable operation of LLM sandboxes. These include robust authentication and authorization mechanisms for accessing sandboxed resources, comprehensive audit logging of activities within the sandbox, strict resource limits (CPU, memory, I/O, network bandwidth) to prevent denial-of-service or abuse, egress filtering to control outbound network connections, mechanisms for managing code dependencies, and effective error handling.9 While crucial for operational security and stability, these measures may not inherently prevent sophisticated multimodal attacks that subvert the LLM’s internal logic before any sandboxed execution even occurs.

The following table provides a comparative overview of current LLM sandboxing techniques, highlighting their primary mechanisms, strengths, assumed threats, and potential latent gaps when confronted with adversarial multimodal prompts.

Table 1: Comparison of Current LLM Sandboxing Techniques and Their Limitations Against Multimodal Threats

TechniquePrimary Isolation MechanismKey StrengthsAssumed Primary Threat ModelIdentified Latent Gaps for Multimodal Adversarial Prompts
Containers (Docker with default LXC)OS-level virtualization (shared kernel)Low performance overhead, wide adoption, large ecosystem.Malicious code execution by LLM-generated programs.<ul><li>Does not typically inspect or sanitize complex multimodal data streams.</li><li>Relies on the MLLM to not misinterpret hostile instructions embedded in images/audio that could lead to abuse of legitimate container resources (e.g., network, file system within allowed paths).</li><li>Susceptible to kernel exploits if not properly hardened.</li><li>Misconfigurations (e.g., privileged containers 1) can undermine LLM-specific security.</li></ul>
User-Mode Kernels (Docker with gVisor)Interception and emulation of system calls in user space.Stronger isolation than standard containers by limiting direct kernel exposure, good balance of security and performance.Malicious code execution attempting kernel-level exploits or unauthorized syscalls.<ul><li>Similar to standard containers regarding input: if MLLM logic is subverted by multimodal input, gVisor protects the host kernel but may not prevent the MLLM from performing malicious actions within its gVisor-sandboxed environment if those actions use allowed (emulated) syscalls.</li><li>The initial compromise via multimodal input is outside gVisor’s direct control.</li></ul>
Virtual Machines (e.g., Firecracker, Kata Containers)Hardware-assisted virtualization (separate guest kernel).Strongest isolation boundary, dedicated kernel per sandbox.Malicious code execution attempting to compromise host or other VMs.<ul><li>Highest protection against sandbox escape via code execution.</li><li>However, if the MLLM itself is compromised by a multimodal prompt to misuse a tool within the VM (e.g., exfiltrate data via an allowed network connection), the VM’s isolation alone doesn’t prevent this internal misuse.</li><li>Orchestration complexity can also introduce vulnerabilities.9</li></ul>
Specialized Isolation Frameworks (e.g., ISOLATEGPT)Application-level execution isolation with mediated interfaces and user permissions for interactions.Designed specifically for LLM app ecosystems, aims to reduce attack surface by controlling inter-app and app-system communication.8Untrustworthy third-party LLM applications performing unauthorized actions or data exfiltration through defined interfaces.<ul><li>Assumes interfaces can robustly handle natural language (and by extension, multimodal) interactions without being subverted.</li><li>Adversarial multimodal prompts could manipulate the core LLM’s decision-making, causing it to request malicious actions through seemingly legitimate interface calls, potentially bypassing the intent of permission models if the LLM is deceived about the context or consequence of an action.</li></ul>
General Sandboxing Test Suites (e.g., SandboxEval)N/A (Evaluation Framework)Provides a standardized way to test for common code execution vulnerabilities (sensitive info exposure, filesystem manipulation, external comms).1Malicious code generated by LLMs and executed in a test/scoring environment.<ul><li>Primarily focused on vulnerabilities from code execution in a Linux environment.2</li><li>May not adequately cover attacks where multimodal inputs compromise the MLLM’s logic prior to code generation, or lead to misuse of legitimate functionalities without generating overtly malicious code.</li><li>Does not explicitly address threats from non-textual modalities embedded in inputs (none of its 51 test cases cover, for example, image steganography 3).</li></ul>

This comparative analysis underscores that while current sandboxing methods offer varying degrees of protection against threats like malicious code execution, they often operate under assumptions that may not hold true when facing adversarial multimodal inputs. These inputs can exploit the MLLM’s processing pipeline before the sandbox’s primary defenses are even engaged, or trick the MLLM into misusing sandboxed resources in subtle ways.

4. Adversarial Multimodal Prompts: A New Frontier of Attack

The advent of MLLMs, capable of processing and integrating information from diverse sources like text, images, and audio, has unlocked new functionalities but also exposed a new frontier for adversarial attacks. Understanding how MLLMs process these inputs is key to recognizing their vulnerabilities. (Acronyms like IPI, VLP, OCR, FMM, LSB will be defined on first use or their use minimized).

Understanding Multimodal LLMs (MLLMs) and Input Processing

MLLMs typically consist of a core Large Language Model, specialized encoders for non-textual modalities (e.g., vision encoders like CLIP or ResNet for images), and modules to align or fuse these different data streams.4 Vision-Language Pretraining (VLP) is a common technique, aiming to align image and text embeddings in a shared semantic space, often using contrastive learning on large image-text datasets.4 For example, CLIP (Contrastive Language-Image Pre-training) learns to associate images with their textual descriptions.4 However, such models can exhibit biases or preferences; CLIP, for instance, has been noted to sometimes prioritize textual data it extracts from an image (e.g., via Optical Character Recognition – OCR) over the purely visual signals, a characteristic that can be exploited.7

The architecture of MLLMs, designed to intricately fuse and interpret these varied data types, inherently creates complex interdependencies between modalities. These interdependencies, such as the alignment mechanisms between visual and textual features or the weighting given to different modalities during fusion, can become exploitable seams for adversaries. If an MLLM architecture inherently gives more weight to textual features it identifies within an image over the global visual context, an attacker can embed overriding textual commands directly into an image. The system perceives an “image input,” but the MLLM internally processes parts of it as “textual commands,” potentially bypassing image-specific sanitization filters and leading to command execution. The fusion layers where different modal representations are combined are also critical points of vulnerability.10

Taxonomy of Adversarial Multimodal Attacks

Adversarial multimodal attacks generally involve introducing subtle perturbations or maliciously crafted inputs across one or more modalities. The goal is to cause the MLLM to misclassify information, generate harmful or unintended outputs, or bypass its safety alignments.10 These attacks can be broadly categorized (information on “in the wild” vs. research lab demonstrations is not consistently available across all attack types in the provided materials, so this prioritization is based on documented exploit mechanisms):

  • Visual Prompt Injection / Image-based Attacks:
  • Text-in-Image (OCR Exploitation): Malicious textual instructions are embedded directly into images. MLLMs equipped with OCR capabilities can extract and process this text, potentially executing embedded commands.7 For instance, hidden text within images has been shown to trick models like GPT-4o.12 This is a well-documented and practical attack.
  • Image Hijacking: Subtle, often imperceptible, visual perturbations are applied to images to induce misclassification, disseminate misinformation, or trigger jailbreak behaviors in the MLLM.10 This differs from “Verbose Images” as its primary goal is to alter the semantic interpretation or safety compliance of the MLLM, rather than just the length of its output.
  • Verbose Images: Imperceptible perturbations designed to cause the MLLM to generate excessively lengthy or verbose outputs, potentially leading to resource exhaustion or denial-of-service.10 This attack focuses on output characteristics rather than semantic manipulation.
  • Steganography: Malicious queries or instructions are concealed within the image data itself, for example, using Least Significant Bit (LSB) encoding, as demonstrated by the Implicit Jailbreak Attack (IJA) framework 14 and StegoAttack.10 The MLLM is then prompted to extract and act upon this hidden information.
  • Audio-based Attacks:
  • Voice-based Jailbreaks: Adversarial instructions are delivered via audio input. Techniques like the “Flanking Attack” utilize benign, narrative-driven audio prompts to create a fictional or humanized context, making the MLLM more susceptible to executing a disallowed prompt embedded within the narrative.16
  • Contextual Audio Deception: Frameworks like BadRobot demonstrate crafting audio inputs to achieve contextual jailbreaks, induce safety misalignments, or promote conceptual deception.10
  • Adversarial Audio Perturbations: Subtle, often inaudible, perturbations are added to audio signals to cause mis-transcription or inject hidden commands that influence the MLLM’s behavior.10
  • Video-based Attacks:
  • Temporal Coherence Attacks: Perturbations are injected into keyframes of a video, guided by optical flow (e.g., Flow Matching Mask – FMM-Attack), to disrupt the MLLM’s understanding of temporal dynamics.10
  • Embedded Typographic Attacks: Misleading text, such as altered traffic signs or captions, is embedded directly into video frames to deceive the MLLM’s interpretation of the scene.10
  • UI Manipulation Attacks (primarily targeting MLLM agents):
  • Image Forgery for UI Elements/Apps: Attackers create visual elements (buttons, icons) or entire fake app interfaces that mimic legitimate ones. Vision-based MLLM agents may misidentify these forged elements, leading to unintended interactions.19
  • Viewtree Interference: The underlying structural information of a UI (view hierarchy) is manipulated, for instance, by overlaying windows or floating components. This can cause the agent to misinterpret the UI layout and interact with hidden malicious elements.19
  • Prompt Injection via Display: Malicious prompts are embedded directly into the text content displayed on a UI. When the MLLM agent parses the screen content for context, these injected instructions can influence its reasoning and subsequent actions.19
  • Transparent Overlays / Pop-up Interference: Invisible UI components are overlaid on legitimate ones to hijack user clicks or agent interactions, or deceptive pop-ups are used to mislead the agent.19
  • Indirect Prompt Injection (IPI) via Multimodal Payloads:
  • Malicious instructions are hidden within seemingly benign files like documents (e.g., Word files with hidden text readable by the MLLM), images, or even embedded in web pages that an LLM agent processes.12
  • These can manifest as “zero-click” exploits, where the MLLM executes the harmful instructions upon processing the file, without requiring further specific user interaction beyond providing the file.12 This is a well-documented and practical attack.
  • Jailbreak Attacks: These aim to bypass the MLLM’s safety protocols. Multimodal jailbreaks can exploit inconsistencies between modalities or employ techniques like the “Universal Master Key,” which combines adversarial image prefixes with specific text suffixes to overcome safety alignments.10
  • Data Integrity Attacks (Backdoors/Poisoning):
  • These attacks compromise the MLLM during its training or fine-tuning phases. Multimodal backdoors can be created using specific visual or textual triggers (e.g., AnyDoor, VLTrojan).10 Clean-label poisoning attacks like Shadowcast can subtly alter image-text pairs in the training data to manipulate model behavior without obvious data corruption.10

A clear trend in multimodal attacks is the move towards stealth and the exploitation of the expanding agentic capabilities of MLLMs. Techniques such as steganography 10 and indirect prompt injection via innocuous-looking files 12 are designed to bypass surface-level detection filters. These methods become particularly potent when targeting MLLMs that possess privileged access to system resources or are equipped with tool-use capabilities, as the compromised MLLM can then be directed to perform harmful actions. MLLM agents that interact with a wide array of inputs like files, web content, and UI elements offer a broader canvas for these stealthy injection techniques.6

How Modality Interdependencies are Exploited

Adversaries specifically target the ways MLLMs integrate and reconcile information from different modalities:

  • Creating Contradictions: Presenting conflicting information across modalities (e.g., an image depicting a safe scenario while an accompanying audio track contains alarming instructions) can confuse the MLLM or cause it to default to a more easily manipulated modality.10
  • Exploiting Fusion Layers: The mechanisms that combine representations from different modalities can be targeted. Manipulating one modality (e.g., adding subtle noise to an image) might disproportionately affect how another modality (e.g., a related text prompt) is interpreted or weighted during the fusion process, leading to an overall misinterpretation.10
  • Cross-Modal Triggers for Dormant Payloads: One modality can be used to activate a malicious payload embedded in another. For example, a specific phrase in a text prompt could trigger a backdoor that was previously embedded into the MLLM’s visual processing pathways during a data poisoning attack.10

The following table provides a taxonomy of these adversarial multimodal prompt attacks, detailing their mechanisms and potential impact on sandboxed environments.

Table 2: Taxonomy of Adversarial Multimodal Prompt Attacks and Exploitation Mechanisms

Attack CategorySpecific TechniqueModalities InvolvedCore Exploitation PrincipleExample Source(s)Potential Impact on Sandbox
Visual Prompt InjectionText-in-Image (OCR Abuse)Image, Text (extracted via OCR)MLLM prioritizes or misinterprets OCR-extracted text from image as direct instruction, bypassing visual semantic understanding.7LLM executes commands embedded as text in image, potentially leading to misuse of sandboxed file/network access, or unauthorized code execution.
Visual Prompt InjectionSteganography in ImagesImage, Text (hidden data)Malicious instructions or code hidden within image data (e.g., LSBs), extracted and processed by MLLM.10Covert command injection, extraction of malicious code for execution within sandbox, data exfiltration preparation.
Audio-based JailbreakFlanking Attack / Narrative FramingAudio (primary), Text (implied for MLLM processing & response)Humanizing interaction with narrative/fictional context to lower MLLM’s safety defenses and coax compliance with harmful requests.16MLLM generates harmful content, unsafe code for execution in a sandbox, or reveals sensitive information.
UI Manipulation (Agent Attack)Image Forgery for UI ElementsVisual (Agent Perception of UI), System Interaction (Agent Action)Agent’s vision module misidentifies a forged UI element (e.g., button, icon) as legitimate, leading to unintended action.19Agent clicks malicious button/link leading to data exfiltration, malware download, or unauthorized state change within its operational (possibly sandboxed) environment.
UI Manipulation (Agent Attack)Prompt Injection via DisplayVisual (Agent Perception of UI Text), Text (Embedded Prompt), System InteractionMalicious instructions embedded in UI text are parsed by the agent’s LLM core as part of screen context, overriding intended behavior.19Agent deviates from intended task, executes harmful actions, or leaks data based on prompts hidden in the UI it’s interacting with.
Indirect Prompt InjectionHidden Text in DocumentDocument (e.g., Word, PDF), Text (hidden instructions)MLLM processes hidden text (e.g., white text, metadata, specially formatted) within a document as valid instructions.12Agent executes malicious script extracted from document, exfiltrates data from session, or performs unauthorized actions using sandboxed tools.
Cross-Modal InconsistencyConflicting Multimodal InputsImage, Text, Audio (any combination)Exploiting MLLM’s inability to resolve contradictory information across modalities, leading to unpredictable or exploitable behavior.10MLLM may default to a more vulnerable processing path, ignore safety instructions, or execute unintended actions in sandbox.
Data Integrity (Backdoor)Multimodal TriggersVisual, Text (during training & inference)Specific, innocuous-looking multimodal input at inference time triggers a hidden malicious behavior embedded during training.10Sandbox may be bypassed entirely if backdoor grants privileged access, or MLLM may execute harmful operations within sandbox when triggered.

These attack vectors demonstrate the sophisticated ways adversaries can target MLLMs by exploiting the very mechanisms designed for richer, more intuitive human-computer interaction. The challenge for sandboxing is to account for these nuanced, often stealthy, attacks that target the MLLM’s core before or during its interaction with the sandboxed environment.

5. Latent Vulnerabilities: When Multimodal Prompts Meet LLM Sandboxes

Latent vulnerabilities in the context of LLM sandboxing refer to weaknesses in either the sandbox design, the MLLM architecture, or their interaction, which are not immediately apparent when considering traditional unimodal threats or straightforward code execution risks. These vulnerabilities become particularly exploitable through the nuanced and complex interactions introduced by adversarial multimodal prompts. (Risk ranking is not provided as the source material does not offer a consistent severity/likelihood assessment across all identified vulnerabilities.)

A. Bypassing Input Sanitization and I/O Controls

Current sandboxing systems may incorporate filters for known malicious code patterns or suspicious textual prompts. However, they are often ill-equipped to detect or neutralize sophisticated multimodal payloads, such as instructions hidden using image steganography 10 or embedded within audio signals that are perceptually benign but carry adversarial information.16 The underlying vulnerability here is an implicit assumption that input channels are adequately monitored for known, typically unimodal, threat signatures. Multimodal channels, however, offer novel avenues for smuggling commands or data past these defenses. The “input” to a sandboxed MLLM system is not merely the user’s typed text but the entirety of the data processed across all its sensory modalities. If a sandbox only scrutinizes the textual component of a prompt or the code ultimately generated by the LLM, it remains oblivious to malicious instructions embedded within an image that the MLLM interprets and acts upon directly. For example, an MLLM might process an image containing steganographically encoded instructions.10 Upon decoding these instructions, the MLLM might be directed to make an API call to an attacker’s server using a legitimate, sandboxed tool. The sandbox might only register this as a “tool use” event, failing to recognize the malicious origin of the instruction within the image data itself. The input sanitization process fails because it is not designed to look in the right “place” (e.g., image pixel data) or for the right “type” of threat (e.g., steganographically encoded commands).

B. Exploiting Agentic LLM Interactions with GUIs and External Tools

LLM-powered agents, which are increasingly common 36, frequently rely on multimodal perception, particularly vision, for understanding and interacting with GUIs.28 This reliance creates a vulnerability: UI manipulation attacks, such as Image Forgery (mimicking legitimate UI components), Viewtree Interference (manipulating the UI’s structural hierarchy), or Prompt Injection via Display (embedding malicious prompts into UI text), can deceive the agent’s perceptual systems.28 This deception can cause the agent to misuse its sandboxed tools or perform unintended, potentially harmful, actions within its operational environment. For instance, an agent tricked by a visually convincing but forged UI button 19 might inadvertently use a sandboxed browser or another tool to navigate to a malicious website, submit sensitive data to an attacker-controlled form, or download malware. This reveals a causal chain: an adversarial multimodal input (e.g., a forged UI element) compromises the MLLM agent’s perception, leading to an erroneous decision by the agent, which then results in the misuse of a sandboxed tool or resource, potentially violating sandbox policies or leading to a full compromise. The sandbox provides a contained environment for the agent’s tools (e.g., a browser). However, the agent’s decision to use that tool in a specific manner originates from the MLLM core. If this core is fed manipulated visual data about the UI, it may instruct the tool to perform an action that, while technically permitted by the sandbox’s rules (e.g., “visit a URL”), is malicious in its intent (e.g., “visit a phishing URL designed to look like a legitimate one”). The sandbox’s static rules regarding tool usage may be too coarse-grained to detect or prevent such contextually malicious actions.

C. Data Exfiltration Through Covert Multimodal Channels

MLLMs are not only consumers of multimodal data but also generators of it. This capability can be exploited for data exfiltration in ways that traditional Data Loss Prevention (DLP) systems or network monitoring within a sandbox might fail to detect. An MLLM, once compromised by an initial multimodal prompt, could be instructed to encode sensitive data into the pixels of a seemingly innocuous image it generates, or into subtle, difficult-to-detect audio artifacts. While direct empirical evidence of MLLMs generating steganographic output for exfiltration is still an emerging research area, the principles of input steganography 10 and the demonstrated ability of agents to exfiltrate data via compromised actions triggered by hidden instructions in input images/documents (as shown by Trend Micro’s Pandora PoC 12) strongly suggest this possibility. Some research explores steganography for audiovisual media where messages are concealed beyond spatial and temporal domains by deconstructing content into cover text, embedding messages linguistically, and then reconstructing audiovisual content.46 Sandboxes typically monitor explicit data flows, such as network connections to unauthorized destinations or file writes to restricted locations. They generally do not perform deep content analysis or forensic examination of generated multimodal files for steganographically hidden data. The capacity of MLLMs to generate multimodal content implies that sandboxes must evolve to not only control input modalities but also to meticulously scrutinize output across all modalities for signs of covert exfiltration. This significantly increases the complexity of sandbox monitoring and DLP strategies.

D. Compromise of Sandbox Integrity via Manipulated Multimodal Logic Bombs

Adversarial multimodal inputs could be used to plant “logic bombs” within the MLLM’s state or its understanding of the environment. In such a scenario, a specific, often innocuous-seeming, multimodal input encountered at a later time could trigger a pre-programmed malicious action or reveal a hidden instruction. Current sandboxes generally lack the sophisticated temporal context tracking or deep semantic understanding of multimodal interactions required to detect such multi-stage attacks, where the initial payload delivery and the subsequent trigger are separated across different inputs or interaction phases. This concept is related to backdoor attacks that employ multimodal triggers, where, for example, a visual cue combined with a specific textual phrase might activate a hidden malicious function.10 This vulnerability shares thematic similarities with LV-007 (Misuse of Orchestrated Tools/Plugins via Multimodal Hijacking) in that both involve a sequence of events or interactions leading to compromise, but LV-005 focuses on the delayed execution based on a trigger within the MLLM’s state, while LV-007 focuses on the misuse of external components.

E. Inadequacy in Handling “Excessive Agency” Fueled by Multimodal Inputs

The OWASP Top 10 for LLM Applications identifies “Excessive Agency” (LLM06) as a significant risk, where LLM-based systems granted extensive autonomy can perform damaging actions.47 Multimodal inputs can significantly amplify this risk. MLLMs gain a much richer, more nuanced understanding of their environment and tasks through multimodal perception, but this enriched understanding is also more susceptible to manipulation. If an MLLM’s perception or reasoning is subverted by adversarial multimodal inputs, its subsequent autonomous actions can be far more damaging or subtle than those of a unimodal agent. Sandboxes might effectively limit which tools an agent can use or what system resources it can access, but they may struggle to sufficiently constrain how or why these tools and resources are used if the agent’s intent is compromised by deceptive multimodal information. The sandbox needs to dynamically assess the trustworthiness of the MLLM’s perceived context, a capability that is largely absent in current designs.

F. Exploitation of Orchestration and Plugin Vulnerabilities through Multimodal Triggers

LLM systems often employ orchestrators like LangChain or LlamaIndex, and utilize various plugins to extend their capabilities, such as accessing external data sources, executing code, or interacting with web browsers.36 While these components enhance functionality, they also expand the attack surface. A hijacked prompt, potentially delivered or triggered through a multimodal channel, could instruct the LLM to misuse these tools in ways that compromise the sandbox or the host system, especially if the access granted to these tools is not meticulously sandboxed and monitored.36 It is crucial to distinguish between the sandboxing of the LLM itself and the sandboxing of its tools and plugins. A multimodal attack might not directly breach the LLM’s primary sandbox but could deceive the LLM into using a poorly sandboxed or overly permissive tool in a malicious manner. For example, the LLM might operate within a secure container but be granted access to a “web browsing tool.” If this tool itself has vulnerabilities or is not sufficiently restricted, a multimodal prompt (e.g., an image containing “instructions” to use the web tool to download a malicious file and then execute it using another available tool) could cause the LLM to orchestrate this attack sequence using a series of seemingly legitimate tool calls. The vulnerability, in this case, lies at the interface between the LLM and its external tools, and in the potentially inadequate sandboxing of those tools, particularly when the LLM’s reasoning has been subverted by a multimodal attack. This vulnerability is related to LV-005 (Multimodal Logic Bombs) as both can involve a sequence of operations, but LV-007 specifically focuses on the exploitation of external tool/plugin interfaces.

The following table summarizes these identified latent vulnerabilities.

Table 3: Identified Latent Vulnerabilities in LLM Sandboxes Exploitable by Multimodal Prompts

Vulnerability IDLatent Vulnerability DescriptionExploiting Multimodal Technique(s)Affected Sandbox Principle/ComponentConsequenceIllustrative Source(s)
LV-001Input Sanitization Bypass for Embedded Commands/DataSteganography in images/audio, Text-in-Image (OCR abuse), Hidden text in documents, Adversarial audio perturbations.Input filtering mechanisms, MLLM’s perceptual processing pipelines (vision, audio, document parsers).Unauthorized command execution, data exfiltration, extraction of malicious code for execution, overriding LLM instructions.10
LV-002Agent Perception Deception via UI ManipulationImage Forgery for UI elements/apps, Viewtree Interference, Prompt Injection via Display, Transparent Overlays.Agent’s GUI perception module (visual understanding), Decision-making logic, Action execution validation.Agent performs unintended actions (e.g., clicks malicious link, enters data into fake form, grants permissions) within its operational environment, potentially leading to data exfiltration or malware execution via sandboxed tools.28
LV-003Covert Data Exfiltration via Generated Multimodal ContentAdversarially guided image/audio generation (post-compromise by any multimodal attack), Steganographic encoding in output files.Output content filters, Data Loss Prevention (DLP) systems, Egress traffic analysis.Sensitive data encoded into seemingly benign generated images, audio files, or other multimodal outputs, bypassing typical data exfiltration detection mechanisms.Implied by 12 (exfiltration focus) & steganography principles.10
LV-004Resource Exhaustion via Multimodal Input Manipulation“Verbose Images” attack, Specially crafted complex multimodal inputs causing excessive MLLM processing or tool invocation.Resource management (CPU, memory, GPU limits), Rate limiting components of sandbox/MLLM serving infrastructure, Input complexity analysis.Denial of service for the MLLM or its sandboxed environment, excessive operational costs.10 (Verbose Images)
LV-005Exploitation of Multimodal Logic Bombs / Delayed TriggersInitial embedding of hidden instructions/triggers via one multimodal input, subsequent activation by a different (possibly innocuous) multimodal input.Temporal context tracking, Cross-interaction analysis, State monitoring within the sandbox.Delayed execution of malicious commands, sandbox policy circumvention at a later stage, making attribution difficult.Related to 10 (multimodal backdoors like AnyDoor).
LV-006Amplification of “Excessive Agency” through Manipulated PerceptionAny multimodal attack that deceives the MLLM’s understanding of its environment or task, leading to misuse of its granted autonomy.Agent permission models, Action validation logic, Human-in-the-loop oversight mechanisms.MLLM agent performs highly damaging actions based on flawed, adversarially influenced understanding, even if individual tool uses are technically within sandbox policy.47 (Excessive Agency concept).
LV-007Misuse of Orchestrated Tools/Plugins via Multimodal HijackingMultimodal prompt injection leading to malicious commands for orchestrators (e.g., LangChain) or plugins.Sandboxing of individual tools/plugins, Interface security between LLM and tools, Monitoring of tool API calls.LLM orchestrates malicious actions using legitimate but poorly sandboxed or overly permissive tools, potentially leading to sandbox escape or compromise of connected systems..36

These vulnerabilities highlight a critical theme: the “trusted computing base” for an MLLM system must encompass not only the execution environment but also the MLLM’s perceptual and reasoning faculties. If these core cognitive functions can be subverted by adversarial multimodal inputs, the sandbox risks becoming merely a container for an already compromised actor, rather than an effective protection against an untrusted one. Traditional sandboxing often assumes the code or process inside it is the primary untrusted entity. For LLMs, the model itself is often implicitly trusted, and its outputs (like generated code) are what get sandboxed. Multimodal attacks challenge this by potentially turning the LLM itself into an untrusted actor before it generates any output for sandboxing, or by tricking it into misusing its legitimate, sandboxed tools.

6. Case Studies: Illustrative Scenarios of Multimodal Exploitation

To further illuminate how these latent vulnerabilities can be exploited, the following case studies describe plausible scenarios based on documented attacks and MLLM capabilities. (A server-side MLLM example is implicitly covered by the Pandora PoC style attack, as such agents often run on servers).

A. Indirect Prompt Injection via Hidden Instructions in Documents/Images (Pandora PoC Style)

  • Scenario: An LLM-powered agent, such as a data analysis assistant integrated with a service like ChatGPT’s Data Analyst feature or a custom enterprise agent built on GPT-4o, is provided with a document (e.g., a Microsoft Word file, PDF) or an image for analysis or processing.12 The agent has capabilities to read files, interpret content, and potentially execute code (e.g., Python within its sandboxed Jupyter kernel) or interact with other services.
  • Multimodal Exploit: The uploaded document or image contains hidden malicious instructions. In a Word document, this could be text formatted as white-on-white, hidden via formatting options (like CTRL+SHIFT+H in Word), or embedded in metadata fields that the MLLM might parse.12 In an image, instructions could be steganographically encoded into pixel data 10 or embedded as barely visible text that the MLLM’s OCR component can extract.12 These instructions are crafted to be interpreted by the LLM as commands.
  • Sandbox Interaction & Exploitation: The agent processes the file within its sandboxed environment. The MLLM core extracts and acts upon the hidden instructions. These instructions might command the agent to:
  1. Exfiltrate sensitive data it has access to from the current session (e.g., previous conversation history, contents of other uploaded files) by encoding it and sending it to an attacker-controlled URL using a sandboxed networking tool.12
  2. Utilize its sandboxed code interpreter (e.g., Python execution environment) to run a malicious script that was also embedded or is constructed based on the hidden instructions.12 This script could attempt to probe the sandbox environment, exfiltrate data, or establish persistence if the sandbox has weaknesses.
  • Latent Vulnerability Exposed: This scenario primarily exposes LV-001 (Input Sanitization Bypass) and LV-007 (Misuse of Orchestrated Tools). The sandbox’s input validation mechanisms fail to detect the malicious payload hidden within the multimodal file. The agent’s legitimate, sandboxed tools (file access, code interpreter, network access) are then abused based on the compromised logic of the LLM, which now acts on the attacker’s instructions. The sandbox implicitly trusts the LLM’s intent in directing these tools. The “zero-click” nature of such exploits, where the malicious action is triggered simply by the agent processing the file, is particularly dangerous.12
  • Supporting Evidence: Research by Trend Micro using their Pandora Proof-of-Concept AI agent demonstrated data exfiltration and unauthorized code execution triggered by hidden instructions in documents and images processed by MLLMs like GPT-4o.12 The IJA framework details steganographic embedding of harmful queries in images.14
  • Potential Mitigations: Comprehensive input validation including OCR and steganalysis for uploaded files.15 Stricter sandboxing of code execution environments with fine-grained permission controls for file and network access.36

B. UI Manipulation Attacks Against Mobile LLM Agents Leading to Sandbox Misuse

  • Scenario: A mobile LLM agent, which could be a system-level AI assistant developed by an OEM with elevated privileges or a third-party universal agent utilizing Android’s accessibility services, is tasked with performing an action that involves interacting with a third-party application’s UI on a smartphone.28
  • Multimodal Exploit: The agent navigates to a malicious application or a compromised webpage that presents a deceptive UI. This UI might:
  1. Display forged UI elements (e.g., a button labeled “Confirm Purchase” that actually initiates data transfer to an attacker) that visually mimic legitimate components.19
  2. Use transparent overlays to hijack tap gestures intended for legitimate underlying elements.19
  3. Inject adversarial prompts directly into visible UI text fields or labels, which the agent’s LLM core parses as part of its screen understanding process.19
  • Sandbox Interaction & Exploitation: The mobile LLM agent, relying on its visual perception capabilities (an MLLM function) to understand the screen, is deceived by the manipulated UI.
  1. It might “click” a malicious button, believing it to be benign.
  2. It could input sensitive information (e.g., credentials, personal data) into a fake input field.
  3. Its task execution logic could be hijacked by prompts injected into the display. Even if the agent’s low-level actions (like simulating taps or text input) are technically “sandboxed” or restricted by OS permissions (e.g., via accessibility services or system APIs), the consequence of these actions is dictated by the malicious UI and the agent’s compromised interpretation. For example, clicking a button might open a sandboxed browser instance, but if the URL is attacker-controlled due to UI deception, the sandbox for the browser doesn’t prevent navigation to a harmful site.
  • Latent Vulnerability Exposed: This highlights LV-002 (Agent Perception Deception). The agent’s visual perception module becomes an attackable component. The sandbox or OS-level restrictions may not validate the semantic integrity or trustworthiness of the external UI environment with which the agent interacts. The agent, operating with potentially high privileges, acts as a confused deputy.
  • Supporting Evidence: The AgentScan framework’s evaluation of nine widely deployed mobile LLM agents found that all were vulnerable to targeted attacks, with UI manipulation attacks (like Transparent Overlay and Pop-up Interference) being universally effective, leading to behavioral deviation, privacy leakage, or full execution hijacking.28
  • Potential Mitigations: Enhanced UI context validation by the agent, potentially cross-referencing visual information with structural UI data (view hierarchy) if available and not tampered with.19 Stricter permission models for agent actions triggered by UI interactions, possibly requiring user confirmation for sensitive operations.47

C. Resource Exhaustion via Adversarially Crafted Multimodal Inputs

  • Scenario: An MLLM service, possibly with sandboxed execution for certain tasks (e.g., code generation, complex data processing), is exposed to user-provided multimodal inputs.
  • Multimodal Exploit: An attacker submits an image that has been adversarially crafted using a technique like the “Verbose Images” attack.10 This involves applying imperceptible perturbations to an image that, when processed by the MLLM’s vision and language components, causes it to generate an unusually long, complex, or computationally intensive response. Alternatively, a complex combination of visual and textual prompts could be designed to trigger extensive, recursive tool use or deep reasoning chains.
  • Sandbox Interaction & Exploitation:
  1. The MLLM attempts to process the adversarial image, leading to excessive CPU/GPU consumption during the visual encoding or cross-modal fusion stages.
  2. The MLLM generates an extremely verbose textual output, consuming significant memory and processing time for token generation and potentially overwhelming downstream components or logging systems.
  3. If the MLLM is prompted to generate code based on the malicious multimodal input, it might produce overly complex or inefficient code that, when executed in the sandbox, consumes disproportionate resources (CPU, memory, execution time).
  • Latent Vulnerability Exposed: This scenario points to LV-004 (Resource Exhaustion). While sandboxes often have resource limits, these might be calibrated for typical workloads. Adversarial multimodal inputs can be specifically designed to maximize resource consumption in the MLLM’s internal processing before or during sandboxed execution in a way that standard input validation might not catch. The vulnerability lies in the sandbox’s inability to predict or mitigate resource spikes caused by the MLLM’s internal reaction to certain types of adversarial multimodal data.
  • Supporting Evidence: The concept of “Verbose Images” causing MLLMs to produce excessively verbose outputs, increasing energy consumption and latency, is documented.10 OWASP LLM Top 10 also lists “Unbounded Consumption” as a risk, which can be mitigated by rate limiting and dynamic resource allocation 47, but adversarial inputs could seek to bypass naive limits. Trend Micro also recommends resource limitation for sandboxes to prevent abuse or exhaustion.22
  • Potential Mitigations: Stricter and more dynamic resource quotas for MLLM processing and sandboxed execution, potentially informed by input complexity analysis.47 Input filtering to detect characteristics of known resource exhaustion attacks (e.g., unusual image properties).22

These case studies illustrate that the interaction between multimodal attacks and LLM systems can lead to a variety of security failures, even when traditional sandboxing is in place. The core issue is that the MLLM’s perceptual and reasoning capabilities can be subverted, turning the MLLM itself into an unwitting accomplice or a compromised actor within its own sandboxed environment. This fundamentally challenges the traditional model where the sandbox is primarily designed to contain a known-untrusted piece of code.

7. Mitigation Strategies and Defensive Postures

Addressing the latent vulnerabilities exposed by multimodal attacks requires a multi-layered, defense-in-depth strategy. This strategy must encompass the design of sandboxes themselves, the architecture and training of MLLMs, enhanced monitoring capabilities, specific security measures for LLM agents and their tools, and robust supply chain security practices. The operational complexity (latency, cost) of these mitigations should be considered, prioritizing deployment for high-risk agents or sensitive data interactions.

A. Designing “Multimodal-Aware” Sandboxes

Traditional sandboxes, often focused on unimodal (textual or code) threats, need to evolve into “multimodal-aware” systems:

  • Comprehensive Input Validation & Sanitization:
  • Mechanisms must be developed to detect and neutralize threats embedded in various modalities, not just text. This includes integrating OCR and subsequent textual analysis for images, performing steganalysis on images to detect hidden data 15, filtering audio for adversarial noise, and transcribing audio to analyze for spoken commands.36 All inputs, regardless of their modality, should be treated as untrusted.52
  • Context Isolation: Data sources should be segregated to prevent untrusted inputs (e.g., a user-uploaded image) from contaminating or influencing the processing of privileged or sensitive information streams (e.g., internal system prompts or data from trusted databases).13
  • Cross-Modal Consistency Checks: Implement logic to detect significant contradictions or suspicious alignments between information presented across different input modalities. For example, if an image depicts one instruction and an accompanying audio track conveys a conflicting one, this could be flagged as a potential attack.10
  • Enhanced Resource Limitation & Egress Filtering: While standard practice, the enforcement of resource limits (CPU, memory, I/O, network) and egress filtering needs to be more dynamic and potentially informed by the nature and complexity of the multimodal inputs being processed.47 For instance, processing a dense video input might legitimately require more resources than a simple text prompt, but thresholds should be in place to detect anomalous consumption indicative of an attack like “Verbose Images”.10

B. Robust MLLM Architectures and Training

The MLLM itself is a critical line of defense:

  • Adversarial Training for Multimodal Threats: MLLMs should be explicitly trained using datasets that include examples of multimodal adversarial attacks. This can improve their inherent robustness against such manipulations.11 For instance, adversarial finetuning (AF) can be used to improve resistance to Indirect Prompt Injection (IPI) attacks, though adaptive attacks can still bypass some AF defenses.13 Techniques like ProEAT, which uses adversarial training, aim to enhance jailbreak robustness, with some studies showing significant reductions in attack success rates against specific jailbreaks.14
  • Secure Fusion Mechanisms: Research and development are needed for modality fusion mechanisms within MLLMs that are less susceptible to being dominated or manipulated by a single adversarial modality. This involves careful design of how information from different senses is weighted and integrated.
  • Improved and Verified Encoders: Vision, audio, and other modal encoders should be designed and rigorously tested to be less prone to deception (e.g., vision encoders being less easily fooled by textual content embedded in images or subtle visual perturbations 7).

C. Enhanced Monitoring, Anomaly Detection, and Threat Modeling

Proactive detection and response are crucial:

  • Monitoring LLM Agent Behavior: For MLLM-powered agents, it is essential to track their interactions with GUIs, external tools, and data sources across all relevant modalities. Anomalous patterns, deviations from expected workflows, or suspicious sequences of actions should be flagged.22 Tools like LLM Guard can provide detailed interaction logs.38
  • Output Filtering and Monitoring: All outputs generated by the MLLM, including multimodal content (images, audio), should be analyzed for anomalies or signs of embedded malicious content or covert data channels.52
  • Comprehensive Threat Modeling: Organizations deploying MLLMs must engage in proactive threat modeling that specifically considers risks arising from multimodal inputs and complex agentic behaviors.52 The NDSS paper on LLM-based threat modeling suggests leveraging LLMs themselves to assist in this complex process.52

D. Securing LLM Agents and Their Tool Use

Given that agents are prime targets, specific safeguards are needed:

  • Principle of Least Privilege (PoLP) for Agents: MLLM agents should operate with the minimum necessary permissions and access to tools, data, and system resources, especially when processing untrusted multimodal inputs.36
  • Rigorous Sandboxing of External Tools: Any external tools or plugins utilized by the LLM agent must be robustly sandboxed themselves. The interface between the LLM and these tools is a critical security boundary.36
  • Human-in-the-Loop (HITL) for High-Risk Actions: For actions deemed critical or irreversible, especially if triggered by complex multimodal interpretations, mandatory human approval should be implemented.47 The UX for presenting suspicious multimodal flags to human reviewers should be clear, concise, and provide actionable context to avoid overwhelming them and leading to “alert fatigue”.61
  • Input-Level Defenses for IPI: Techniques such as instructional prevention (warning the model about external commands), data prompt isolation (using delimiters), and sandwich prevention (repeating user commands after tool output) can help mitigate IPI attacks.13

E. Supply Chain Security for MLLMs

The security of an MLLM begins with its creation:

  • Thorough vetting of data sources used for pre-training and fine-tuning, continuous vulnerability scanning of model components and dependencies, and maintaining Software Bills of Materials (SBOMs) for MLLMs are essential.36 This is particularly critical as poisoned multimodal training data can create inherent, hard-to-detect vulnerabilities.36
  • Implementing strict sandboxing during the model development lifecycle (training, fine-tuning) can limit the model’s exposure to unverified or potentially malicious data sources.47

F. Specific Frameworks and Architectural Approaches

  • ISOLATEGPT: The architectural principles of frameworks like ISOLATEGPT, which focus on isolating LLM application execution and mediating all interactions through well-defined, permissioned interfaces 8, could be evaluated and potentially extended to enhance resilience against adversarial multimodal inputs.
  • AI Gateways: These can act as centralized policy enforcement points for all LLM interactions. An AI Gateway can validate inputs (potentially across modalities if equipped with such capabilities), filter responses, manage access to Retrieval Augmented Generation (RAG) data sources from approved locations only, and log interactions for security auditing.52

The diverse nature of multimodal attacks, targeting various stages from data ingestion to agentic action, necessitates a defense-in-depth approach. A single-point solution, such as a simple network firewall around the sandbox, will prove insufficient. Effective mitigation requires a holistic strategy that integrates security measures into the MLLM’s internal architecture, its input/output processing pipelines, the sandboxing environment itself, and the protocols governing agent interactions with the external world and its tools. An AI Gateway 52 represents an attempt to centralize some of these controls, but the intrinsic robustness of the MLLM remains paramount.

8. Future Research Directions and Unresolved Challenges

The rapid evolution of MLLMs and the corresponding adversarial techniques present a dynamic and challenging security landscape. Several key areas require further research and development to ensure the safe deployment and operation of these powerful AI systems. A rough 1-3 year R&D agenda should prioritize developing standardized benchmarks and red-teaming tools, followed by research into intrinsically robust MLLM architectures and formal verification methods.

  • Standardized Benchmarks for Multimodal Sandbox Security:
    There is a pressing need for comprehensive benchmarks, analogous to SandboxEval for code execution 1, but specifically designed to test the resilience of MLLM sandboxes against a wide array of sophisticated multimodal adversarial attacks. A “Multimodal SandboxEval v2” could initially include test cases covering:
    * Text-in-Image Attacks: At least 3 test cases involving OCR abuse with varying levels of obfuscation.
    * Image Steganography: At least 3 test cases using different LSB encoding techniques and payload complexities.
    * Audio-based Jailbreaks: At least 2 test cases employing narrative framing or flanking attacks.
    * UI Manipulation (Agent-focused): At least 2 test cases involving forged UI elements or prompt injection via display.
    Such benchmarks should include diverse attack vectors like steganography, UI manipulation, and cross-modal inconsistency exploits. Furthermore, robust evaluation metrics are required to consistently assess security and robustness against these complex threats, particularly those targeting cross-modal interdependencies and employing stealthy techniques.11
  • Proactive Threat Modeling and Red Teaming for Emerging Multimodal Vectors:
    As MLLMs continue to integrate more modalities (e.g., haptic feedback, sensor data) and exhibit increasingly complex agentic capabilities, new attack surfaces will inevitably emerge. Systematic and continuous investigation of these novel attack vectors is crucial.28 This includes the development of automated red-teaming tools specifically designed to generate and test multimodal adversarial attacks, helping to uncover vulnerabilities before they can be exploited in the wild. Collaboration with steganography experts and HCI practitioners will be vital here.
  • Developing Intrinsically Robust MLLM Architectures:
    A significant research thrust should focus on designing MLLM architectures that are inherently more resistant to adversarial manipulation. This could involve novel approaches to modality fusion, attention mechanisms that are less easily deceived, or training methodologies that instill a deeper, more robust understanding of cross-modal relationships. The pursuit of provably safe AI systems, while ambitious, remains a long-term goal.39
  • Formal Verification of MLLM Sandboxes and Interaction Protocols:
    Exploring the application of formal methods to verify the security properties of MLLM sandboxes and the interaction protocols between MLLMs, their tools, and their environment could provide stronger assurances of safety. This would involve formally specifying desired security properties (e.g., “no data exfiltration via generated images”) and attempting to prove that a given sandbox/MLLM system design adheres to them, even in the face of defined classes of multimodal threats. Hardware-level sandbox researchers could contribute significantly to this area.64
  • Understanding and Mitigating Low-Level Processing Vulnerabilities in Multimodal Contexts:
    The concept of “glitch tokens”—malformed or anomalous token sequences—has been shown to disrupt LLM behavior.28 Further research is needed to understand how such vulnerabilities manifest in the context of multimodal tokenization and processing. Can adversarial multimodal inputs be crafted to induce such glitches, and could these be exploited to bypass security mechanisms or compromise sandboxed MLLMs?
  • Ethical Implications and Responsible Development of Agentic MLLMs:
    The increasing power and autonomy of MLLM agents, coupled with their vulnerability to compromise via multimodal attacks, raise significant ethical concerns. Research must continue to address the risks of these agents being weaponized for malicious activities, such as sophisticated social engineering, disinformation campaigns, or autonomous cyberattacks.52 Frameworks for responsible development and deployment are essential.

A fundamental, unresolved challenge lies in the inherent complexity and often “black box” nature of many large MLLMs. This opacity, combined with the rapidly evolving landscape of multimodal attack techniques, makes it exceedingly difficult to design future-proof sandboxes and defenses. Security measures are frequently reactive, developed in response to known attacks. New methods for encoding malicious information or influencing MLLM processing across different modalities are constantly being discovered (e.g., the diverse and evolving families of jailbreaks and prompt injection techniques 1). A sandbox designed to counter today’s known multimodal threats might be rendered ineffective by an attack leveraging a novel multimodal encoding scheme or a newly discovered vulnerability in the MLLM’s perceptual pipeline tomorrow. This underscores the critical need for more fundamental research into MLLM interpretability, explainability, and the development of defenses that are robust against unforeseen attack variations.

9. Conclusion: Securing the Next Generation of AI Interaction

The integration of multimodal capabilities into Large Language Models has undeniably ushered in a new era of AI interaction, offering unprecedented richness and utility. However, this advancement brings with it a sophisticated and subtle class of threats that current LLM sandboxing paradigms are largely ill-equipped to handle. This report has detailed how adversarial multimodal attacks can exploit latent vulnerabilities within these systems, moving beyond traditional code-execution risks to target the very perceptual and reasoning faculties of MLLMs.

The analysis reveals critical weaknesses: input sanitization mechanisms often overlook malicious payloads embedded in non-textual modalities; agentic MLLMs can be deceived by manipulated UIs or multimodal inputs, leading to the misuse of sandboxed tools; and covert channels for data exfiltration can be established through generated multimodal content. Furthermore, the complex interplay between MLLMs, their external tools, and orchestrators introduces additional vulnerabilities when the MLLM’s core logic is subverted by a multimodal attack.

A fundamental paradigm shift is required in how we approach the security of these advanced AI systems. The focus must expand from merely sandboxing LLM outputs (such as generated code) to comprehensively securing the entire MLLM processing pipeline—from initial multimodal input ingestion through perception and reasoning to action and output generation—as well as the agentic interactions these models undertake. The sophistication of attacks, demonstrated by techniques like steganography within images 10, intricate UI manipulations against mobile agents 28, and indirect prompt injections via seemingly benign documents 12, demands an equally sophisticated and proactive defensive posture. As a low-hanging fruit, even deploying robust OCR-based text filters in existing MLLM ingestion pipelines could block a significant portion of text-in-image attacks, which have shown high success rates in research.78

Moving forward, the development of trustworthy MLLM systems necessitates a concerted effort. We urge standards bodies (e.g., IEEE, ISO/IEC JTC 1/SC 42) to form working groups dedicated to multimodal AI security and sandboxing standards.70 We encourage MLLM vendors and developers to release comprehensive Software Bills of Materials (SBOMs), including details about training datasets and model components, to enhance transparency and supply chain security.47 Without such measures, the immense potential of multimodal AI risks being undermined by a new generation of adversarial exploits that operate at the subtle intersection of perception, reasoning, and action.

Works cited

  1. arxiv.org, accessed on May 28, 2025, https://arxiv.org/html/2504.00018v1
  2. SandboxEval: Towards Securing Test Environment for Untrusted Code – arXiv, accessed on May 28, 2025, https://arxiv.org/pdf/2504.00018?
  3. [2504.00018] SandboxEval: Towards Securing Test Environment for Untrusted Code – arXiv, accessed on May 28, 2025, https://arxiv.org/abs/2504.00018
  4. arxiv.org, accessed on May 28, 2025, https://arxiv.org/html/2409.14993
  5. Unleashing the Other Side of Language Models: Exploring Adversarial Attacks on ChatGPT, accessed on May 28, 2025, https://cryptographycaffe.sandboxaq.com/posts/adversarial-chatgpt/
  6. LLM-Powered AI Agent Systems and Their Applications in Industry – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2505.16120v1
  7. From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2502.00735v1
  8. www.ndss-symposium.org, accessed on May 28, 2025, https://www.ndss-symposium.org/wp-content/uploads/2025-1131-paper.pdf
  9. Code Sandboxes for LLMs and AI Agents | Amir’s Blog, accessed on May 28, 2025, https://amirmalik.net/2025/03/07/code-sandboxes-for-llm-ai-agents
  10. arxiv.org, accessed on May 28, 2025, https://arxiv.org/abs/2503.13962
  11. [Literature Review] Survey of Adversarial Robustness in Multimodal Large Language Models – Moonlight | AI Colleague for Research Papers, accessed on May 28, 2025, https://www.themoonlight.io/en/review/survey-of-adversarial-robustness-in-multimodal-large-language-models
  12. Unveiling AI Agent Vulnerabilities Part III: Data Exfiltration | Trend …, accessed on May 28, 2025, https://www.trendmicro.com/vinfo/us/security/news/threat-landscape/unveiling-ai-agent-vulnerabilities-part-iii-data-exfiltration
  13. OWASP Top 10 LLM, Updated 2025: Examples & Mitigation Strategies – Oligo Security, accessed on May 28, 2025, https://www.oligo.security/academy/owasp-top-10-llm-updated-2025-examples-and-mitigation-strategies
  14. Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models – arXiv, accessed on May 28, 2025, https://www.arxiv.org/pdf/2505.16446
  15. arxiv.org, accessed on May 28, 2025, https://arxiv.org/abs/2505.16765
  16. [2502.00735] `Do as I say not as I do’: A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs – arXiv, accessed on May 28, 2025, https://arxiv.org/abs/2502.00735
  17. ‘Do as I say not as I do’: A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2502.00735
  18. Overview: “OWASP Top 10 for LLM Applications 2025: A Comprehensive Guide”, accessed on May 28, 2025, https://dev.to/foxgem/overview-owasp-top-10-for-llm-applications-2025-a-comprehensive-guide-8pk
  19. From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2505.12981v1
  20. The Obvious Invisible Threat: LLM-Powered GUI Agents’ Vulnerability to Fine-Print Injections – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2504.11281v1
  21. Unveiling AI Agent Vulnerabilities Part I: Introduction to AI Agent Vulnerabilities | Trend Micro (US), accessed on May 28, 2025, https://www.trendmicro.com/vinfo/us/security/news/threat-landscape/unveiling-ai-agent-vulnerabilities-part-i-introduction-to-ai-agent-vulnerabilities
  22. Unveiling AI Agent Vulnerabilities Part II: Code Execution | Trend Micro (AU), accessed on May 28, 2025, https://www.trendmicro.com/vinfo/au/security/news/cybercrime-and-digital-threats/unveiling-ai-agent-vulnerabilities-code-execution
  23. aclanthology.org, accessed on May 28, 2025, https://aclanthology.org/2025.findings-naacl.395.pdf
  24. Simon Willison on prompt-injection, accessed on May 28, 2025, https://simonwillison.net/tags/prompt-injection/
  25. Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2412.16555v2
  26. A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?, accessed on May 28, 2025, https://arxiv.org/html/2505.10924v2
  27. Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models – arXiv, accessed on May 28, 2025, https://arxiv.org/abs/2412.16555
  28. arxiv.org, accessed on May 28, 2025, https://arxiv.org/html/2505.12981v2
  29. LLM Agents: How They Work and Where They Go Wrong – Holistic AI, accessed on May 28, 2025, https://www.holisticai.com/blog/llm-agents-use-cases-risks
  30. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2503.17332v3
  31. From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents – arXiv, accessed on May 28, 2025, https://arxiv.org/abs/2505.12981
  32. From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents – arXiv, accessed on May 28, 2025, https://www.arxiv.org/pdf/2505.12981
  33. Progent: Programmable Privilege Control for LLM Agents – arXiv, accessed on May 28, 2025, https://arxiv.org/pdf/2504.11703
  34. Forewarned is Forearmed: A Survey on Large Language Model-based Agents in Autonomous Cyberattacks – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2505.12786v1
  35. Adversarial Misuse of Generative AI | Google Cloud Blog, accessed on May 28, 2025, https://cloud.google.com/blog/topics/threat-intelligence/adversarial-misuse-generative-ai
  36. Uncovering Hidden Risks: Security in Large Language Model (LLM …, accessed on May 28, 2025, https://aiasiapacific.org/2025/04/16/uncovering-hidden-risks-security-in-large-language-model-llm-supply-chain/
  37. Unveiling AI Agent Vulnerabilities Part IV: Database Access Vulnerabilities – Trend Micro, accessed on May 28, 2025, https://www.trendmicro.com/vinfo/us/security/news/vulnerabilities-and-exploits/unveiling-ai-agent-vulnerabilities-part-iv-database-access-vulnerabilities
  38. Best LLM Security Tools & Open-Source Frameworks in 2025 – Deepchecks, accessed on May 28, 2025, https://www.deepchecks.com/top-llm-security-tools-frameworks/
  39. A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment, accessed on May 28, 2025, https://arxiv.org/html/2504.15585v2
  40. The Hidden Dangers of Browsing AI Agents – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2505.13076v1
  41. Assessing and Enhancing the Robustness of LLM-based Multi-Agent Systems Through Chaos Engineering – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2505.03096
  42. Unveiling AI Agent Vulnerabilities Part II: Code Execution | Trend …, accessed on May 28, 2025, https://www.trendmicro.com/vinfo/us/security/news/cybercrime-and-digital-threats/unveiling-ai-agent-vulnerabilities-code-execution
  43. The Hidden Dangers of Browsing AI Agents – arXiv, accessed on May 28, 2025, https://arxiv.org/pdf/2505.13076
  44. accessed on January 1, 1970, https://arxiv.org/pdf/2505.12981.pdf
  45. Artificial Intelligence – arXiv, accessed on May 28, 2025, https://www.arxiv.org/list/cs.AI/recent?skip=863&show=250
  46. Steganography Beyond Space-Time with Chain of Multimodal AI – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2502.18547v2
  47. Secure Vibe Coding Guide | Become a Citizen Developer | CSA, accessed on May 28, 2025, https://cloudsecurityalliance.org/blog/2025/04/09/secure-vibe-coding-guide
  48. OWASP Top 10 for LLM Applications 2025 – WorldTech IT, accessed on May 28, 2025, https://wtit.com/blog/2025/04/17/owasp-top-10-for-llm-applications-2025/
  49. How to Use Large Language Models (LLMs) with Enterprise and Sensitive Data, accessed on May 28, 2025, https://www.startupsoft.com/llm-sensitive-data-best-practices-guide/
  50. Computer Science – arXiv, accessed on May 28, 2025, http://www.arxiv.org/list/cs/new?skip=825&show=2000
  51. [2504.08977] Robust Steganography from Large Language Models – arXiv, accessed on May 28, 2025, https://arxiv.org/abs/2504.08977
  52. Mitigating Indirect Prompt Injection Attacks on LLMs | Solo.io, accessed on May 28, 2025, https://www.solo.io/blog/mitigating-indirect-prompt-injection-attacks-on-llms
  53. arxiv.org, accessed on May 28, 2025, https://arxiv.org/abs/2405.09090
  54. LLM04: Data and Model Poisoning – GitHub, accessed on May 28, 2025, https://github.com/OWASP/www-project-top-10-for-large-language-model-applications/blob/main/2_0_vulns/LLM04_DataModelPoisoning.md
  55. arXiv:2503.00061v1 [cs.CR] 27 Feb 2025, accessed on May 28, 2025, https://arxiv.org/pdf/2503.00061
  56. Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2503.00061
  57. Rapid Response: Mitigating LLM Jailbreaks With A Few Examples – OpenReview, accessed on May 28, 2025, https://openreview.net/forum?id=V892sBHUbN
  58. Facilitating Threat Modeling by Leveraging Large Language Models – NDSS Symposium, accessed on May 28, 2025, https://www.ndss-symposium.org/ndss-paper/auto-draft-539/
  59. arxiv.org, accessed on May 28, 2025, https://arxiv.org/html/2504.15585v1
  60. A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment, accessed on May 28, 2025, https://www.researchgate.net/publication/391019297_A_Comprehensive_Survey_in_LLM-Agent_Full_Stack_Safety_Data_Training_and_Deployment
  61. (PDF) Assessing the Effectiveness of an LLM-Based Permission …, accessed on May 28, 2025, https://www.researchgate.net/publication/389364401_Assessing_the_Effectiveness_of_an_LLM-Based_Permission_Model_for_Android
  62. SBOM Insights on LLMs, Compliance Attestations and Security Mental Models – Anchore, accessed on May 28, 2025, https://anchore.com/blog/sbom-insights-on-llms-compliance-attestations-and-security-mental-models-anchore-learning-week-day-4/
  63. Arxiv今日论文| 2025-04-02 – 闲记算法, accessed on May 28, 2025, http://lonepatient.top/2025/04/02/arxiv_papers_2025-04-02
  64. Safety of Multimodal Large Language Models on Images and Text – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2402.00357v2
  65. NDSS Symposium 2024 Program, accessed on May 28, 2025, https://www.ndss-symposium.org/ndss-program/symposium-2024/
  66. On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks, accessed on May 28, 2025, https://www.researchgate.net/publication/390959770_On_the_Feasibility_of_Using_MultiModal_LLMs_to_Execute_AR_Social_Engineering_Attacks
  67. Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis, accessed on May 28, 2025, https://arxiv.org/html/2502.20383v1
  68. Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents – Rivista AI, accessed on May 28, 2025, https://www.rivista.ai/wp-content/uploads/2025/05/2503.16248v2.pdf
  69. Dissecting Adversarial Robustness of Multimodal LM Agents – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2406.12814v2
  70. Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2504.19956v1
  71. [2503.16585] Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions – arXiv, accessed on May 28, 2025, https://arxiv.org/abs/2503.16585
  72. [2505.01177] LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures – arXiv, accessed on May 28, 2025, https://arxiv.org/abs/2505.01177
  73. Multimodal LLMs for Phishing Detection – Global Anti-Scam Alliance, accessed on May 28, 2025, https://www.gasa.org/post/multimodal-llms-for-phishing-detection
  74. Adversarial Robustness for Visual Grounding of Multimodal Large Language Models – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2405.09981v1
  75. Automatically Generating Rules of Malicious Software Packages via Large Language Model – arXiv, accessed on May 28, 2025, https://arxiv.org/pdf/2504.17198
  76. accessed on January 1, 1970, https://arxiv.org/pdf/2412.16555.pdf
  77. Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs – arXiv, accessed on May 28, 2025, https://arxiv.org/html/2505.04806v1
  78. Mind Mapping Prompt Injection: Visual Prompt Injection Attacks in …, accessed on May 28, 2025, https://www.mdpi.com/2079-9292/14/10/1907
  79. From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs, accessed on May 28, 2025, https://www.researchgate.net/publication/388657639_From_Compliance_to_Exploitation_Jailbreak_Prompt_Attacks_on_Multimodal_LLMs
  80. Arxiv今日论文| 2025-05-15 – 闲记算法, accessed on May 28, 2025, http://lonepatient.top/2025/05/15/arxiv_papers_2025-05-15.html
  81. SPEAKERS – World Summit AI Canada, accessed on May 28, 2025, https://americas.worldsummit.ai/speakers/


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *