Cookie Consent

Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.

Regression Testing for Machine Learning: How to Do It Right

In this blog series, we’ll investigate how we can better test machine learning applications. In the first post, we’ll look at what we mean by ML testing, what an ML bug is, and where they occur, as well as introduce the first technique for your ML testing repertoire: regression testing.

Lakera Team

October 20, 2023

Last updated:

March 27, 2025

Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Download now

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?
‍
Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?
‍
Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Text Link

Machine learning systems are only as reliable as the tests that validate them. Now that we’ve discussed data bugs, let’s shift focus to testing the behavior that emerges from that data.

In this post, we explore one of the most essential techniques in the ML testing toolkit: regression testing. It’s a practical way to improve reliability and maintain consistent performance as your models evolve.

On this page

Hide table of contents

Show table of contents

Regression testing isn’t enough for GenAI. Add dynamic testing and red teaming with Lakera Red.

‍

‍

The Lakera team has accelerated Dropbox’s GenAI journey.

“Dropbox uses Lakera Guard as a security solution to help safeguard our LLM-powered applications, secure and protect user data, and uphold the reliability and trustworthiness of our intelligent features.”

What Is Regression Testing?

In traditional software development, regression testing refers to:

“…re-running functional and non-functional tests to ensure that previously developed and tested software still performs after a change.” [1]

Let’s say you find a bug, fix it, and want to make sure it doesn’t return in future versions. The solution? Add a test for it. That way, if the bug ever reappears, the test will catch it immediately. That’s the essence of regression testing.

Why Regression Testing Matters in Machine Learning

In machine learning, bugs can reappear after something as routine as retraining your model. This is especially likely when your datasets are constantly evolving.

ML regression testing helps you catch these issues. It ensures that your model keeps meeting baseline performance requirements, even as data and parameters change.

A Simple Way to Get Started

Each time your ML system fails on a tricky input, add that example to a “difficult cases” dataset. Use it as a regression test set that becomes part of your testing pipeline. Over time, you’ll build a valuable resource to track whether performance on known weak spots is improving—or breaking again.

Real-World Example: Olympic Integrity

Imagine you’ve built a computer vision system to detect whether runners stay in their lane during races.

The system performs well in cloudy conditions. But on sunny days, it misinterprets a runner’s shadow as the runner stepping out of bounds, triggering a false disqualification alert. That’s an ML bug.

Here’s how regression testing can help:

Collect similar images where shadows confuse the model.
Add them to a dedicated regression dataset.
Retrain the model with improved data.
Regularly test your model on both the standard test set and the regression set.

By doing this, you can catch regressions early and prevent the same bug from recurring in future updates.

Proactive ML Testing with Regression Sets

Regression testing doesn’t have to be reactive. It’s also a great proactive strategy for monitoring ML performance over time—especially in real-world deployments.

Say you’re deploying your model across multiple customer sites. How do you make sure it works equally well everywhere?

By building targeted regression datasets for key scenarios (e.g., different lighting conditions, locations, or user groups), you can track how performance holds up under each context.

Let’s look at how Tesla approaches this.

Regression Testing in Practice: The Tesla Approach

In 2020, Andrej Karpathy, Tesla’s Director of AI, shared how the company uses large-scale regression testing to validate its autopilot system [2].

Tesla has developed an advanced testing infrastructure that can:

Automatically create test sets for specific scenarios.
Mine edge cases from fleet-collected data.
Continuously evaluate system behavior at scale.

They don’t just react to bugs—they design regression sets to proactively stress-test the system.

You Don’t Need to Be Tesla

You can apply similar principles on a much smaller scale.

For example, in the Olympic runner use case, you could create mini regression datasets for:

Male vs. female athletes.
Red vs. blue running tracks.
Bright vs. cloudy lighting conditions.

Tracking model performance across these subsets will give you ongoing insights—and confidence—into how well your system generalizes.

Final Thoughts

Regression testing in machine learning is a low-effort, high-impact strategy to build more trustworthy models. It helps ensure that progress doesn’t come at the cost of stability.

Start small, stay consistent, and watch your model’s reliability improve over time.