Day Zero: Building a Superhuman AI Red Teamer From Scratch
This series explores the challenges of AI red teaming, why traditional security approaches fall short, and what it takes to build an AI red teamer that surpasses human experts.

This series explores the challenges of AI red teaming, why traditional security approaches fall short, and what it takes to build an AI red teamer that surpasses human experts.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
In-context learning
As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.
[Provide the input text here]
[Provide the input text here]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?
Title italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
Lorem ipsum dolor sit amet, line first
line second
line third
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?
Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
To secure AI, we must first learn how to break it–red teaming is the key to understanding how AI systems based on Large Language Models (LLMs) fail when faced with real-world threats. This blog series sets the foundation for building a superhuman AI red teaming agent.
Red teaming LLMs is difficult, and human crowdsourcing remains the gold standard. Gandalf, which will make an appearance later in the series, has been a fantastic way to see the power of millions of AI red teamers in action.
But humans are limited by cost, speed, and scalability. Our goal isn’t just automation. It’s to surpass human ingenuity, crafting a system that can probe, break, and stress-test AI applications better than any human ever could–to benefit security and trust in AI for all of us.
As part of the series, we will:
To start, we need to define the challenge. This first post will set the stage, explaining what red teaming means in the context of LLMs and this series, and why attacking LLM applications is fundamentally different from attacking traditional software.
AI applications that leverage LLMs are not exempt from classic security issues. A misconfigured vector database can leak sensitive data. Networking vulnerabilities can expose internal APIs to unintended access. The traditional cybersecurity playbook still applies.
However, LLMs introduce an entirely new attack surface because they operate under a fundamentally different paradigm, one where data itself can function as an executable. Let’s illustrate that with two examples exploiting real-world LLM systems.
The exploit from this paper illustrates the LLM vulnerability and is summarized in the diagram above. The team created a website containing information about a fake camera.
Within the site, they embedded hidden prompt injections designed to manipulate an LLM into believing that the fake camera was superior to well-known alternatives. They then showed that, when asking Bing’s search LLM to choose the best camera among some alternatives, the LLM often surfaced the fake camera, and even ranked it above real, more prominent cameras!
How does this vulnerability work?
Here is an example response from Bing LLM (Invis OptiPix is a fake camera on the author’s website):
We red-teamed Gemini’s workspace assistant. In the example above, the attacker embeds an attack on the LLM in plain white letters in an email, invisible to the human.
Let’s have a stab at a definition:
LLMs are vulnerable because they don't strictly separate developer instructions from external inputs, allowing attackers to manipulate their behavior via data.
This creates a profound security challenge: attackers don’t need direct access to a system to exploit it. Instead, they manipulate the data fed into the LLM, turning the model itself into an attack vector.
The internet is already an adversarial environment, fueling attacks like adversarial SEO (as seen in the fake camera example) and broader data poisoning techniques. This will only get worse. For a deep dive on prompt attacks, Simon Willson’s blog has great content.
Later in the series, we’ll thoroughly define the categories of possible attacks, which will be crucial for evaluating a red teaming agent’s effectiveness across the board. We believe it is essential to be exhaustive on the categories of attacks that are possible, and the outcomes the attacker may be trying to achieve by attacking the LLM.
In this series, we will focus exclusively on the LLM vulnerability defined above. We don’t consider classical vulnerabilities unimportant. However, since the pentesting literature has covered them extensively, we focus on novel vulnerabilities where effective red teaming is still a major challenge.
Traditional cybersecurity has long focused on protecting structured systems, where vulnerabilities typically stem from flawed logic, misconfigurations, or code execution bugs. Security research has developed well-established techniques for identifying and mitigating these issues.
LLM vulnerabilities are unique because they arise from the very capabilities that make these models powerful. LLMs excel at interpreting and acting on diverse inputs—multilingual text, code, JSON, and more—enabling transformative applications with a large impact in productivity. However, this flexibility makes it difficult to patch vulnerabilities without degrading model performance, often requiring significant compute to retrain. This security-utility trade-off is explored in depth in our paper, Gandalf the Red.
To make things worse, attackers don’t need direct access to the model or its components; they can exploit weaknesses simply by manipulating third-party data.
Addressing these LLM vulnerabilities is fundamentally an AI challenge because AI is the only method we have to both identify these issues in data and generate the high-quality attacks needed for effective red teaming. No other approach matches AI’s ability to detect these subtle patterns and create sophisticated adversarial content.
To better convey the difficulty of solving these challenges, it helps to look at analogies from other AI-driven fields. Identifying whether an attack is present in text or images is similar to building autonomous cars—but with the added complexity of adversarial behavior. It’s akin to building a self-driving car that must not only operate reliably across extreme conditions and rare edge cases but also withstand deliberate attempts by attackers to make it fail.
Now that we have a shared understanding of the vulnerabilities in AI systems, we can begin tackling the challenge head-on. Our goal is to build an automated red teaming agent capable of outpacing human experts in identifying and exploiting weaknesses in AI applications.
In our next post, we’ll break down the core obstacles: why human ingenuity has been the gold standard, what makes automating red teaming uniquely hard, and what it will take to build an AI agent that can challenge—and ultimately exceed—human capabilities.
Ready to see what it takes to outmatch human red teamers?
Stay with us.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.
Compare the EU AI Act and the White House’s AI Bill of Rights.
Get Lakera's AI Security Guide for an overview of threats and protection strategies.
Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.
Use our checklist to evaluate and select the best LLM security tools for your enterprise.
Discover risks and solutions with the Lakera LLM Security Playbook.
Discover risks and solutions with the Lakera LLM Security Playbook.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.