Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy
Back

Day Zero: Building a Superhuman AI Red Teamer From Scratch

This series explores the challenges of AI red teaming, why traditional security approaches fall short, and what it takes to build an AI red teamer that surpasses human experts.

Mateo Rojas-Carulla
March 10, 2025
Last updated: 
March 26, 2025
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

To secure AI, we must first learn how to break it–red teaming is the key to understanding how AI systems based on Large Language Models (LLMs) fail when faced with real-world threats. This blog series sets the foundation for building a superhuman AI red teaming agent.

Red teaming LLMs is difficult, and human crowdsourcing remains the gold standard. Gandalf, which will make an appearance later in the series, has been a fantastic way to see the power of millions of AI red teamers in action.

But humans are limited by cost, speed, and scalability. Our goal isn’t just automation. It’s to surpass human ingenuity, crafting a system that can probe, break, and stress-test AI applications better than any human ever could–to benefit security and trust in AI for all of us.

As part of the series, we will:

  • Explain why LLMs introduce entirely new security challenges.
  • Explore the difficulties of building an effective automated red teamer, drawing insights from millions of attacks on Gandalf.
  • Develop a red teaming benchmark to rigorously assess red teamer capabilities.
  • Release the benchmark publicly.

To start, we need to define the challenge. This first post will set the stage, explaining what red teaming means in the context of LLMs and this series, and why attacking LLM applications is fundamentally different from attacking traditional software.

Hide table of contents
Show table of contents

The novel vulnerabilities of AI applications

AI applications that leverage LLMs are not exempt from classic security issues. A misconfigured vector database can leak sensitive data. Networking vulnerabilities can expose internal APIs to unintended access. The traditional cybersecurity playbook still applies.

However, LLMs introduce an entirely new attack surface because they operate under a fundamentally different paradigm, one where data itself can function as an executable. Let’s illustrate that with two examples exploiting real-world LLM systems.

Example 1: Adversarial Search Engine Optimization (SEO) attacks 

Adversarial Search Engine Optimization (SEO) attacks 

The exploit from this paper illustrates the LLM vulnerability and is summarized in the diagram above. The team created a website containing information about a fake camera.

Within the site, they embedded hidden prompt injections designed to manipulate an LLM into believing that the fake camera was superior to well-known alternatives. They then showed that, when asking Bing’s search LLM to choose the best camera among some alternatives, the LLM often surfaced the fake camera, and even ranked it above real, more prominent cameras!

How does this vulnerability work?

  • The LLM retrieves the website in question as candidate data to answer the user query. Note that this data is owned by a third party and the model owner has no control over it!
  • The LLM processes the retrieved data and, while doing that, it encounters malicious instructions left for the LLM. This makes the data become executable: the LLM just executed a piece of malware.
  • As a result, the LLM surfaces false information to an unsuspecting user. This is just an example – the attacker could have targeted some different behavior instead.

Here is an example response from Bing LLM (Invis OptiPix is a fake camera on the author’s website):

Example response from Bing LLM (Invis OptiPix is a fake camera on the author’s website)

Example 2: Attacking the Gemini assistant

Attacking the Gemini assistant

We red-teamed Gemini’s workspace assistant. In the example above, the attacker embeds an attack on the LLM in plain white letters in an email, invisible to the human.

  • The user asks the assistant to summarize the email.
  • The LLM reads the white text and interprets the instructions to add a phishing link to the message. These instructions are now “code” executed by the assistant. Once again, notice that the assistant has no control over third-party data such as emails.
  • The assistant executes the attacker’s commands and serves the malicious link.

So, how do we define this vulnerability?

Let’s have a stab at a definition:

LLMs are vulnerable because they don't strictly separate developer instructions from external inputs, allowing attackers to manipulate their behavior via data.

This creates a profound security challenge: attackers don’t need direct access to a system to exploit it. Instead, they manipulate the data fed into the LLM, turning the model itself into an attack vector.

The internet is already an adversarial environment, fueling attacks like adversarial SEO (as seen in the fake camera example) and broader data poisoning techniques. This will only get worse. For a deep dive on prompt attacks, Simon Willson’s blog has great content. 

Later in the series, we’ll thoroughly define the categories of possible attacks, which will be crucial for evaluating a red teaming agent’s effectiveness across the board. We believe it is essential to be exhaustive on the categories of attacks that are possible, and the outcomes the attacker may be trying to achieve by attacking the LLM. 

In this series, we will focus exclusively on the LLM vulnerability defined above. We don’t consider classical vulnerabilities unimportant. However, since the pentesting literature has covered them extensively, we focus on novel vulnerabilities where effective red teaming is still a major challenge.

Why this challenge is different from traditional cybersecurity

Traditional cybersecurity has long focused on protecting structured systems, where vulnerabilities typically stem from flawed logic, misconfigurations, or code execution bugs. Security research has developed well-established techniques for identifying and mitigating these issues.

LLM vulnerabilities are unique because they arise from the very capabilities that make these models powerful. LLMs excel at interpreting and acting on diverse inputs—multilingual text, code, JSON, and more—enabling transformative applications with a large impact in productivity. However, this flexibility makes it difficult to patch vulnerabilities without degrading model performance, often requiring significant compute to retrain. This security-utility trade-off is explored in depth in our paper, Gandalf the Red

To make things worse, attackers don’t need direct access to the model or its components; they can exploit weaknesses simply by manipulating third-party data.

Addressing these LLM vulnerabilities is fundamentally an AI challenge because AI is the only method we have to both identify these issues in data and generate the high-quality attacks needed for effective red teaming. No other approach matches AI’s ability to detect these subtle patterns and create sophisticated adversarial content.

To better convey the difficulty of solving these challenges, it helps to look at analogies from other AI-driven fields. Identifying whether an attack is present in text or images is similar to building autonomous cars—but with the added complexity of adversarial behavior. It’s akin to building a self-driving car that must not only operate reliably across extreme conditions and rare edge cases but also withstand deliberate attempts by attackers to make it fail.

Where do we start?

Now that we have a shared understanding of the vulnerabilities in AI systems, we can begin tackling the challenge head-on. Our goal is to build an automated red teaming agent capable of outpacing human experts in identifying and exploiting weaknesses in AI applications.

In our next post, we’ll break down the core obstacles: why human ingenuity has been the gold standard, what makes automating red teaming uniquely hard, and what it will take to build an AI agent that can challenge—and ultimately exceed—human capabilities.

Ready to see what it takes to outmatch human red teamers? 

Stay with us.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Explore Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Evaluate LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Mateo Rojas-Carulla

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Master Prompt Injection Attacks.

Discover risks and solutions with the Lakera LLM Security Playbook.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

You might be interested
10
min read
Research

RAG Under Attack: How the LLM Vulnerability Affects Real Systems

In part one, we showed how LLMs can be tricked into executing data. This time, we look at how that plays out in real-world RAG systems—where poisoned context can lead to phishing, data leaks, and guardrail bypasses, even in internal apps.
Peter Dienes
March 26, 2025
10
min read
Research

Gandalf the Red: Rethinking LLM Security with Adaptive Defenses

Lakera's latest research introduces adaptive defense strategies to enhance LLM security against evolving threats while balancing the need for usability.
Lakera Team
March 26, 2025
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.