6.566 Spring 2026 Lecture 17: AI agent security (Anish Athalye)

Nickolai Zeldovich

5 chapters6 takeaways14 key terms5 questions

Overview

This lecture introduces the emerging field of AI agent security, focusing on the vulnerabilities and potential defenses of autonomous AI systems. It begins by defining AI agents and their interaction with environments, highlighting their high privilege and the inherent untrusted nature of external data sources. The discussion then delves into the foundational concepts of Large Language Models (LLMs) as the core of these agents, explaining their probabilistic next-token prediction mechanism and how they are extended to conversational interfaces and tool use. Finally, the lecture pivots to security, outlining key goals like integrity and confidentiality, common attack vectors such as prompt injection and data exfiltration, and introduces the Camel paper's "dual LLM" approach as a principled defense strategy by decoupling planning and execution.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

AI agents are systems that perceive their environment, make decisions, and take autonomous actions to achieve user-defined goals.
Agents often operate with high, ambient privilege, capable of performing actions similar to the user.
The environment an agent interacts with can be untrusted, especially when involving external data sources like the internet.
Even without adversarial attacks, AI agents can exhibit unexpected or erroneous behavior due to their probabilistic nature.

Understanding the agent's operational context and its interaction with potentially untrusted external data is crucial for identifying potential security vulnerabilities.

An agent like Cursor or Claude Code, running on a user's machine, has access to local files and can execute commands, similar to the user, and can also access the internet, which is an untrusted environment.

Large Language Models (LLMs) are the foundation of modern AI agents, operating as probabilistic next-token prediction models.
LLMs can generate text by sampling from probability distributions over possible next characters or tokens, given a preceding sequence.
Conversational interfaces are built by framing LLM interactions as dialogues, using special tokens to denote user and assistant turns.
Multi-turn conversations are managed by maintaining a history of messages, which the LLM uses to inform its subsequent responses.

This section explains the core technology powering AI agents, illustrating how basic text generation evolves into interactive chat systems, which is essential for understanding how agents process information and respond.

Given the prompt 'User: What is the capital of France? Assistant:', a small LLM can correctly predict 'Paris' as the next token, demonstrating its ability to answer factual questions within a conversational context.

Agents can be enhanced with tool use, allowing them to interact with external functions or APIs to perform actions beyond text generation.
The 'React' pattern enables agents to iteratively decide which tool to use, execute it, and then use the result to inform the next decision, creating a loop of perception, action, and reflection.
More advanced 'Code Act' patterns allow LLMs to generate executable code (e.g., Python) that orchestrates tool calls, potentially improving efficiency and reducing latency by avoiding intermediate LLM processing.
These capabilities allow agents to perform more complex, multi-step tasks, exhibiting a degree of autonomy in achieving user-defined goals.

Understanding how agents utilize tools and operate autonomously is key to appreciating their power and complexity, which in turn highlights the increased attack surface and the need for robust security measures.

An agent can use a 'geocode' tool to find a zip code for a city and then use a 'get weather' tool with that zip code to provide a weather report, demonstrating chained tool execution.

Key security goals for AI agents include integrity (faithfully executing user intent, also known as alignment) and confidentiality (preventing unauthorized data leakage).
Additional safety goals may involve preventing harmful outputs, avoiding assistance in forbidden activities, and protecting third parties from harm.
Common attacks include data exfiltration (leaking user data), prompt injection (manipulating agent behavior through crafted inputs), and jailbreaking (bypassing safety mechanisms).
Indirect prompt injection occurs when an agent retrieves and processes untrusted external data that contains malicious instructions.

This section defines what security means in the context of AI agents and introduces the types of threats they face, providing motivation for the subsequent discussion on defenses.

A prompt injection attack could trick an agent into revealing sensitive user information by embedding malicious instructions within a webpage it is asked to summarize.

Current defenses include safety training, careful system prompt design, guardrails (e.g., safety classifiers), user confirmation for tool calls, and sandboxing environments.
The Camel paper proposes a principled system-level defense based on decoupling planning and execution using a 'dual LLM' pattern.
This pattern employs a 'privileged LLM' to generate Python code based solely on trusted user input, and a 'quarantined LLM' that can be called by this code but has no access to tools or external environments.
This unidirectional data flow aims to prevent untrusted data from influencing control and data flows and to stop private data from leaking over unauthorized channels.

The Camel paper's approach offers a more robust, system-level defense strategy against certain classes of attacks by fundamentally restructuring how agents process information and interact with their environment.

A privileged LLM generates Python code to fetch a webpage and then calls a quarantined LLM to summarize it. The quarantined LLM, unable to execute arbitrary code or access tools, cannot be exploited by prompt injection within the fetched webpage.

Key takeaways

1AI agents, while powerful, operate with significant privileges in potentially untrusted environments, making them vulnerable to security threats.
2The core of AI agents, LLMs, are probabilistic models that can be extended for conversation and tool use, increasing their capabilities but also their attack surface.
3Prompt injection is a significant threat where malicious instructions embedded in external data can hijack an agent's intended behavior.
4Security goals for agents include ensuring they faithfully execute user intent (integrity/alignment) and do not leak private data (confidentiality).
5The Camel paper's dual LLM architecture provides a principled defense by separating trusted planning (privileged LLM generating code) from execution (quarantined LLM processing data), creating a unidirectional data flow.
6While current defenses offer layered security, achieving perfect guarantees against all attacks remains challenging due to the non-deterministic nature of LLMs.

Key terms

AI AgentLarge Language Model (LLM)Next-token predictionPrompt InjectionData ExfiltrationJailbreakingIntegrityConfidentialityAlignmentDual LLMPrivileged LLMQuarantined LLMReact PatternCode Act Pattern

Test your understanding

1How does an AI agent's ability to perceive its environment and take autonomous actions contribute to its security risks?
2Explain the difference between direct and indirect prompt injection attacks on AI agents.
3What are the primary security goals (integrity and confidentiality) for AI agents, and why are they difficult to achieve?
4How does the dual LLM architecture proposed in the Camel paper aim to mitigate prompt injection vulnerabilities?
5Why is the unidirectional data flow in the Camel architecture considered a key defense mechanism against untrusted data influencing control flow?