Cookie Consent

Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.

3 Strategies for Making Your ML Testing Mission-Critical Now

Testing machine learning systems is currently more of an art form than a standardized engineering practice. This is particularly problematic for machine learning in mission-critical contexts. This article summarizes three steps from our ML testing series that any development team can take when testing their ML systems.

Lakera Team

October 20, 2023

Last updated:

March 11, 2025

On this page

Hide table of contents

Show table of contents

Testing machine learning (ML) systems is currently more of an art form than a standardized engineering practice. This is particularly problematic for machine learning in mission-critical contexts. In these use cases, strict performance guarantees and regulatory compliance are a must. The best engineering teams at companies like Tesla have built sophisticated testing infrastructure to ensure the reliability of their ML systems. Now it is time to make effective and systematic ML testing a reality for the rest of us–including smaller engineering teams–as well.

This article summarizes three steps from our ML testing series that any development team can take when testing their ML systems:

Specify your operational domain 📝

Systematic testing is most effective in the context of an operational domain. An operational domain describes the “specific conditions under which a given [...] automation system is designed to function” [1]. It is a more compact representation of the environment in which the system will operate. This can also include what data it will be exposed to and its user interactions. Specification of the operational domain can be used to detect data bugs. It is also the starting point for establishing reliability guarantees for the whole system.

Stress-test your system 🦾

With the operational domain specified, we can check the robustness of the system within the relevant conditions. As with traditional software systems, bugs often appear when a problematic input is presented to the system. Fuzz testing is a popular strategy that looks for these problematic inputs by randomly generating data points both inside and outside of the operational domain. In combination with metamorphic relations, it becomes a powerful tool that developers can use to ensure that their system performs well enough within the operational domain and degrades gracefully when presented with more challenging inputs.

Ensure your system performs when it really matters ✋

The machine learning development process is inherently complex and iterative. Regression sets provide an effective tool for ensuring that your system really gets better with every iteration. They are a simple tool that can be used not only retroactively (e.g., when a bug has been found and you want to make sure that it does not reoccur) but also proactively (e.g., for creating test sets to actively probe system behavior when performance matters).

Adopting these three strategies is the first step to making ML testing more systematic and effective. They often provide a high return on investment. This is the case especially for smaller engineering teams that don’t have Tesla’s resources but that require strict performance guarantees and want to move through product development quickly and efficiently.

Lakera’s validation engine, MLTest, finds critical performance vulnerabilities in computer vision systems before they enter operation. Built with industry-leading AI and safety expertise, MLTest makes reliability a no-brainer for entire development teams. Get in touch if you want to learn more!

[1] Definition from ISO 21448: Road vehicles — Safety of the intended functionality.

Lakera Team

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download

Your validation set won’t tell you if a model generalizes. Here’s what will.

As we all know from machine learning 101, you should split your dataset into three parts: the training, validation, and test set. You train your models on the training set. You choose your hyperparameters by selecting the best model from the validation set. Finally, you look at your accuracy (F1 score, ROC curve...) on the test set. And voilà, you’ve just achieved XYZ% accuracy.

Václav Volhejn

November 13, 2024

min read

•

Machine Learning

AI Observability: Key to Reliable, Ethical, and Trustworthy AI

AI observability offers deep insights into AI decision-making. Understand how models work, detect bias, optimize performance, and ensure ethical AI practices.

Brain John Aboze

November 13, 2024

Activate
untouchable mode.

Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Book a demo Start for free

Join our Slack Community.

Several people are typing about AI/ML security.  Come join us and 1000+ others in a chat that’s thoroughly SFW.

Join Lakera Momentum Slack