
6.566 Spring 2026 Lecture 17: AI agent security (Anish Athalye)
Nickolai Zeldovich
Overview
This lecture introduces the emerging field of AI agent security, focusing on the vulnerabilities and potential defenses of autonomous AI systems. It begins by defining AI agents and their interaction with environments, highlighting their high privilege and the inherent untrusted nature of external data sources. The discussion then delves into the foundational concepts of Large Language Models (LLMs) as the core of these agents, explaining their probabilistic next-token prediction mechanism and how they are extended to conversational interfaces and tool use. Finally, the lecture pivots to security, outlining key goals like integrity and confidentiality, common attack vectors such as prompt injection and data exfiltration, and introduces the Camel paper's "dual LLM" approach as a principled defense strategy by decoupling planning and execution.
Save this permanently with flashcards, quizzes, and AI chat
Chapters
- AI agents are systems that perceive their environment, make decisions, and take autonomous actions to achieve user-defined goals.
- Agents often operate with high, ambient privilege, capable of performing actions similar to the user.
- The environment an agent interacts with can be untrusted, especially when involving external data sources like the internet.
- Even without adversarial attacks, AI agents can exhibit unexpected or erroneous behavior due to their probabilistic nature.
- Large Language Models (LLMs) are the foundation of modern AI agents, operating as probabilistic next-token prediction models.
- LLMs can generate text by sampling from probability distributions over possible next characters or tokens, given a preceding sequence.
- Conversational interfaces are built by framing LLM interactions as dialogues, using special tokens to denote user and assistant turns.
- Multi-turn conversations are managed by maintaining a history of messages, which the LLM uses to inform its subsequent responses.
- Agents can be enhanced with tool use, allowing them to interact with external functions or APIs to perform actions beyond text generation.
- The 'React' pattern enables agents to iteratively decide which tool to use, execute it, and then use the result to inform the next decision, creating a loop of perception, action, and reflection.
- More advanced 'Code Act' patterns allow LLMs to generate executable code (e.g., Python) that orchestrates tool calls, potentially improving efficiency and reducing latency by avoiding intermediate LLM processing.
- These capabilities allow agents to perform more complex, multi-step tasks, exhibiting a degree of autonomy in achieving user-defined goals.
- Key security goals for AI agents include integrity (faithfully executing user intent, also known as alignment) and confidentiality (preventing unauthorized data leakage).
- Additional safety goals may involve preventing harmful outputs, avoiding assistance in forbidden activities, and protecting third parties from harm.
- Common attacks include data exfiltration (leaking user data), prompt injection (manipulating agent behavior through crafted inputs), and jailbreaking (bypassing safety mechanisms).
- Indirect prompt injection occurs when an agent retrieves and processes untrusted external data that contains malicious instructions.
- Current defenses include safety training, careful system prompt design, guardrails (e.g., safety classifiers), user confirmation for tool calls, and sandboxing environments.
- The Camel paper proposes a principled system-level defense based on decoupling planning and execution using a 'dual LLM' pattern.
- This pattern employs a 'privileged LLM' to generate Python code based solely on trusted user input, and a 'quarantined LLM' that can be called by this code but has no access to tools or external environments.
- This unidirectional data flow aims to prevent untrusted data from influencing control and data flows and to stop private data from leaking over unauthorized channels.
Key takeaways
- AI agents, while powerful, operate with significant privileges in potentially untrusted environments, making them vulnerable to security threats.
- The core of AI agents, LLMs, are probabilistic models that can be extended for conversation and tool use, increasing their capabilities but also their attack surface.
- Prompt injection is a significant threat where malicious instructions embedded in external data can hijack an agent's intended behavior.
- Security goals for agents include ensuring they faithfully execute user intent (integrity/alignment) and do not leak private data (confidentiality).
- The Camel paper's dual LLM architecture provides a principled defense by separating trusted planning (privileged LLM generating code) from execution (quarantined LLM processing data), creating a unidirectional data flow.
- While current defenses offer layered security, achieving perfect guarantees against all attacks remains challenging due to the non-deterministic nature of LLMs.
Key terms
Test your understanding
- How does an AI agent's ability to perceive its environment and take autonomous actions contribute to its security risks?
- Explain the difference between direct and indirect prompt injection attacks on AI agents.
- What are the primary security goals (integrity and confidentiality) for AI agents, and why are they difficult to achieve?
- How does the dual LLM architecture proposed in the Camel paper aim to mitigate prompt injection vulnerabilities?
- Why is the unidirectional data flow in the Camel architecture considered a key defense mechanism against untrusted data influencing control flow?