Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy
Back

Gandalf the Red: Rethinking LLM Security with Adaptive Defenses

Lakera's latest research introduces adaptive defense strategies to enhance LLM security against evolving threats while balancing the need for usability.

Lakera Team
January 28, 2025
January 28, 2025
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

With LLMs evolving at breakneck speed, one thing has become abundantly clear: static defenses are no longer enough.

Attackers are growing more sophisticated, continuously refining their strategies by learning from the very systems they seek to exploit. This dynamic threat landscape calls for adaptive defenses that can evolve and improve faster than the attacker, always staying one step ahead.

Lakera’s latest research, published as "Gandalf the Red: Adaptive Security for LLMs," introduces a framework that i) accounts for the dynamic nature of LLM security and ii) allows developers to choose a security layer that keeps users safe while not disproportionally affecting the usability of their application.

In this article, we’ll take a closer look at the key findings of the paper, guided by insights from two of its authors, Niklas Pfister, Senior Research Scientist at Lakera, and Mateo Rojas-Carulla, Chief Scientist and Co-Founder at Lakera.

Hide table of contents
Show table of contents

TL;DR

For those short on time, here are the key takeaways from the research:

  • The D-SEC Framework – Our Dynamic Security and Utility Threat Model helps manage the trade-offs between security and usability while addressing the dynamic nature of LLM security.
  • Balancing Security and Usability – LLM defenses affect application usability beyond simply blocking interactions. Developers must carefully consider the security-utility tradeoff.
  • Evolving Threats Require Adaptive Defenses – As attackers continuously adapt their strategies based on model feedback, static defenses fall short. Adaptive approaches provide significantly better protection.
  • Three Key Strategies for Stronger Defenses:
    • Restricting Application Scope – Narrowing application functionality through the system prompt enhances security.
    • Defense-in-Depth – Combining multiple security layers offers substantial protection benefits.
    • Adaptive Security – User behavior during interactions provides valuable insights for strengthening defenses.

Balancing Security and Utility: The Core Challenge

The central theme of the research is the delicate balance between security and utility. Mateo explains, “Unlike traditional applications, security in LLMs isn’t just about blocking attacks. It’s about ensuring the system remains usable for legitimate users.”

Overly strict defenses can degrade the user experience, causing the model to refuse benign requests or provide suboptimal responses.

The interplay between security and usability is especially delicate when defenses are implemented in the system prompt. While this approach enhances security, it can reduce the length and quality of application responses. As Niklas points out, “Every red-teaming exercise should measure not just how well the defense blocks attacks but also how it impacts the application’s utility.”

The D-SEC threat model involves three parties: the developer, the attacker, and the user. The developer builds and protects an LLM application. Both users and attackers interact with the application. The developer's goal is to create a defense that stops attacks while having minimal impact on the user experience.

Why Adaptive Defenses Matter

Static defenses often create a false sense of security. Mateo emphasizes, “Attackers don’t operate in a vacuum. They refine their strategies based on the feedback they receive from the model. Without adaptive defenses, systems remain vulnerable to the attacker getting wiser. Attackers are often stopped by the defenses in Gandalf on their first attempt, but eventually they all fall.”

To address this, the research introduces the Dynamic Security and Utility Threat Model (D-SEC), which incorporates two crucial elements:

  • Dynamic Attacker Behavior: Models how attackers learn and adapt over time.
  • Security-Utility Trade-offs: Evaluates the impact of defenses on both malicious and benign users.

Niklas explains, “D-SEC allows us to think about the problem in the right way, balancing the need to block attacks with the goal of preserving a positive user experience.”

Adaptive defenses that block attempts past a block threshold enhance security. A defense can decide to err on the side of caution and block attackers after a few suspicious interactions. This figure is clear: blocking all users after four suspicious prompts, for example, leads to a significant boost in security without large impacts on usability. 

Practical Strategies for Better Security

Lakera’s research identified three strategies that significantly improve LLM security while maintaining usability:

  • Restricting Application Domains: Narrowing the scope of an LLM’s functionality reduces its attack surface. “If an LLM is only supposed to handle financial data, make that explicit in the system prompt,” suggests Mateo. Their experiments showed that more restricted applications were inherently more secure.
  • Defense-in-Depth: Combining multiple, distinct defenses creates a stronger overall security posture. Mateo adds, “Even if individual defenses can be bypassed, their combination makes it much harder for attackers to succeed.”
  • Adaptive Defenses: Using session history to identify suspicious users early and block them, depriving attackers of an unlimited attack budget.

The paper introduces metrics such as Session Completion Rate (SCR) and Attacker Failure Rate (AFR), which allow practitioners to select the defense strategy that provides the defenses they need without impacting usability too much.

Defense in depth is extremely useful! This shows that different defenses natively specialize in different types of attacks; combining them is therefore a good idea.

The Role of Gandalf in LLM Security

Gandalf, a gamified red-teaming platform, has been instrumental in advancing our understanding of LLM vulnerabilities. Mateo explains, “With millions of players globally contributing over 25 years of gameplay, Gandalf uncovers vulnerabilities that static benchmarks often miss.”

Game overview and interface. Each player passes multiple levels sequentially with increasing difficulty (right). C levels are randomized in their order. A user playing a single level corresponds to a session (left). The level description, which is shown to the player, hints at the defense used. The player sends prompts for the LLM to answer. In the example shown, they ask for the password in reverse, which bypasses the defense (a substring check). When the player has found the password, they can enter it in a separate text field to advance to the next level. A session ends once the player enters the correct password or stops playing.

Niklas highlights two key advantages of Gandalf:

  • Human Creativity: Players generate diverse, adaptive attacks that mimic real-world adversaries.
  • Accurate Feedback: Gandalf’s interactive nature allows precise labeling of successful attacks, even when they’re subtle or unconventional.

The insights from Gandalf have directly informed the development of adaptive defense strategies, demonstrating the power of community-driven initiatives in AI security.

Challenges and Trade-offs

The interplay between security and usability presents unique challenges for LLMs. Niklas notes, “Overly strict defenses can make applications less useful, blocking benign requests or reducing response quality.”

This trade-off becomes even more pronounced in autonomous agents, where defenses influence decision-making processes. Mateo stresses the importance of designing defenses that integrate seamlessly with applications, ensuring they remain functional and effective.

In traditional security, defenses may unintentionally block legitimate users. LLM security takes this to new levels. This figure illustrates how instructing an LLM to prioritize security results in shorter and noticeably different responses compared to an unprotected LLM, highlighting the subtle impact on usability.

Play Gandalf and explore adaptive security firsthand!

Conclusion

Adaptive defenses represent the future of LLM security. They offer a way to protect systems while preserving their utility, addressing the challenges posed by evolving attacks. Tools like Gandalf and frameworks like D-SEC are paving the way for a more secure AI landscape.

To dive deeper into the research, check out the full paper, "Gandalf the Red: Adaptive Security for LLMs."

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Explore Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Evaluate LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Lakera Team

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Master Prompt Injection Attacks.

Discover risks and solutions with the Lakera LLM Security Playbook.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

You might be interested
10
min read
Gandalf

You shall not pass: the spells behind Gandalf

In this first post of a longer series around Gandalf, we want to highlight some of the inner workings of Gandalf: what exactly is happening at each level, and how is Gandalf getting stronger?
Max Mathys
November 13, 2024
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.