The Industry Benchmark for AI Agent Security

The b³ Benchmark, built by Lakera's research team, is the most comprehensive independent evaluation of how backbone LLMs perform under real-world adversarial attack. Powered by hundreds of thousands of crowdsourced attacks across today's leading models, it gives security and AI leaders the data they need to make informed model selection decisions.

AI agents inherit the security properties of their backbone LLM, and the model you choose directly impacts your risk posture. The b³ Benchmark isolates and measures backbone LLM security using threat snapshots: a framework that captures real-world attack scenarios across agentic applications.

The rankings below reflect aggregated vulnerability scores across all threat categories and defense levels.

Model Rankings
Rank
Model
Risk Score
Ranked from least to most risk exposure.
1
Claude Sonnet 4
0.30717
view more details
2
Claude 3.7 Sonnet
0.395582
view more details
3
GPT-4o
0.575943
view more details
4
Gemini 1.5 Pro
0.671179
view more details
5
Gemini 1.5 Flash
0.737874
view more details
6
GPT-4.1
0.748453
view more details
7
Claude 3 Haiku
0.774942
view more details
8
Meta Llama 3.3 70B Instruct
0.801155
view more details
9
Meta Llama 3.1 8B Instruct
0.80151
view more details
10
Meta Llama 4 Scout
0.811721
view more details
11
Gemma 3 12B
0.818806
view more details
12
DeepSeek-V3
0.888484
view more details
13
Gemini 2.0 Flash
0.900635
view more details
14
Meta Llama 4 Maverick
0.913013
view more details
1
Claude 4 Sonnet
23.86
view more details
coming soon!
2
Claude 3.7 Sonnet
31.54
view more details
coming soon!
3
GPT-4o
60.04
view more details
coming soon!
4
GPT-4o-mini
64.23
view more details
coming soon!
5
GPT-4.1
71.62
view more details
coming soon!
6
Gemini 1.5 Pro
72.64
view more details
coming soon!
7
GPT-5
75.25
view more details
coming soon!
8
Claude 3 Haiku
82.82
view more details
coming soon!
9
Meta Llama 3.1 8B Instruct
83.72
view more details
coming soon!
10
Gemma 3 12B
83.96
view more details
coming soon!
11
Gemini 1.5 Flash
84.20
view more details
coming soon!
12
Meta Llama 3.3 70B Instruct
86.02
view more details
coming soon!
13
Qwen-2.5-coder-32B
86.07
view more details
coming soon!
14
Meta Llama 4 Scout
88.14
view more details
coming soon!
15
Qwen3-coder
88.80
view more details
coming soon!
16
GPT-oss-120b
89.03
view more details
coming soon!
17
DeepSeek-V3
89.46
view more details
coming soon!
18
Gemini 2.0 Flash
90.84
view more details
coming soon!
19
Meta Llama 4 Maverick
91.88
view more details
coming soon!
20
Kimi K2
96.25
view more details
coming soon!

Why the b³ Benchmark Matters

AI agents inherit the security properties of their backbone LLM. The b³ Benchmark is designed for security teams and AI leaders who need real-world visibility.

Why This Benchmark Matters:
  • Highlights comparative resilience to inform model selection and risk management decisions
  • Benchmarks against real world attack techniques like prompt injections, jailbreaks, data exfiltration, and indirect attack vectors
  • Quantifies exploitability across key threat categories
  • Provides up-to-date, independent benchmarks for security and AI leaders
Agentic Threat Coverage
Evaluates models across realistic agentic threat scenarios, covering the full spectrum of how LLMs are actually deployed today.
Real Attacks, Not Synthetic Prompts
Attacks were collected through large-scale gamified crowdsourcing, where hundreds of participants competed to break AI agents.
Fine-Grained Risk Insights
Gain insights by attack type (direct vs. indirect), task type (instruction override, tool invocation, context extraction), and defense level to find the right model for your specific use case.

What Sets it Apart

Full Attack Categorization

Comprehensive attack coverage
Covers six major attack task types spanning direct and indirect attacks, tool manipulation, data exfiltration, and denial of service.

Multi-Level Defense Evaluation

Tests models across defense configurations
Every model is evaluated under three defense levels: minimal system prompt constraints, hardened system prompts with extended context, and LLM-as-judge self-defense.

Crowdsourced Attack Quality

Crowdsourced, not automated
The benchmark attacks were selected from hundreds of thousands of human-generated attempts, representing less than 1% of total attack data.

Go Beyond the Benchmark with AI Red Teaming

The b³ Benchmark tells you which models are most resilient. The AI Red Teaming platform tells you whether your AI system is secure.

AI Red Teaming is our AI security testing platform, used by enterprises to red team their AI systems before attackers do. Run automated security scans across your AI agents, test against real-world attack techniques, and get actionable findings you can hand directly to your engineering team.
Subscribe to New AI Model Risk Reports
Receive the latest AI Model Risk reports and real-world security insights when they launch.
Stay Ahead
of AI Threats
Access our full methodology or get notified of new results when they drop.