Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy
Back

Why testing should be at the core of machine learning development.

AI (artificial intelligence) is capable of helping the world scale solutions to our biggest challenges but if you haven’t experienced or heard about AI’s mishaps then you’ve been living under a rock. Coded bias, unreliable hospital systems and dangerous robots have littered headlines over the past few years.

Lakera Team
November 13, 2024
January 18, 2022
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

We now live in a world where AI (artificial intelligence) is used in mission-critical systems yet still developed like consumer technology.
As ML (machine learning) engineers, all we want is to build our systems to work without harm to others or getting stuck on a merry-go-round of prototyping. So, how can we get there?

We can take inspiration from traditional software engineering practices!

Why? Well, do you remember the last time your online stock broker purchased the wrong shares? Or Twitter failed to retrieve your latest tweets? I don’t. Major software malfunctioning is so unexpected that when it happens – it makes headline news.

How has software become so reliable?

As C. A. R. Hoare’s classic 1996 article “How did software get so reliable without proof?” points out, the answer is likely around rigorous development processes, continuous improvement of existing software, and extensive testing.

Traditional software goes through well-defined testing and release processes.

Software engineers are more than familiar with concepts such as:

– Test-driven development

– Unit tests

– Regression tests

– Integration tests

Tests are a part of CI/CD (continuous integration/continuous development) pipelines. Engineers don’t merge code unless all tests have passed. By the time they go to production, they are confident that the software works as expected. They follow a “test-to-ship” strategy.

Software has become so reliable through development processes, continuous improvement of existing software, and extensive testing.

What is our strategy when it comes to ML-driven software? As it turns out, things look a bit different.

Let’s take a look at how we typically develop ML systems today. It’s common practice to split our dataset into training, validation, and testing subsets. The first two become part of the model training loop, whereas the testing subset is used separately, outside of the training loop, to assess performance on unseen data. A typical evaluation strategy would include calculating various metrics over these data subsets and using them as an indication of real-world system performance.

It turns out that this strategy is often insufficient. Many teams find that their ML systems end up performing ‘well enough’ on their carefully selected datasets but are too brittle to be used in the real world.

At the same time, creating more complete quantitative testing and release processes is often seen as too time-consuming, especially within smaller teams. We have observed many who instead spend a lot of time on qualitative testing – which tends to fall short of constructing a thorough understanding of performance. As a result, computer vision development follows a “ship-to-test’” strategy.

This graphic illustrates the development process of computer vision systems. Collect data, train, evaluate, release, deploy and then find vulnerabilities.
Computer vision development mostly follows a “ship-to-test” strategy. This impacts customer experience at best and ends in fatal accidents at worst.

The fact that ML systems are only really tested during operation has obvious and major implications. These systems tend to operate with significant risk as vulnerabilities only tend to surface during operation: pedestrians are not detected at night, COVID diagnostics are fundamentally flawed, or systems exhibit undesired biases.  At best, this leads to low customer satisfaction or products that never make it in the market, and at worst, it puts people and society at risk.

The good news is that we can bring some of the concepts from traditional software development to ML development. We need to ensure that vulnerabilities are found during development. So, we need to bring back ‘test-to-ship’.

Lakera’s MLTest provides the quality gate which automatically surfaces vulnerabilities as part of existing processes.

Putting systematic testing at the core of your development processes is a great way to build better AI products faster. Our ML testing series provides a few simple strategies that any development team can use to prevent failure during operation.

Lakera’s MLTest equips every computer vision development team with a world-class testing infrastructure. Our product finds critical vulnerabilities and flaws in computer vision systems–automatically as part of existing development processes and before they can impact operation. We want to enable every team, small to large, to ship AI products quickly and reliably. Get in touch to schedule a demo!

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Explore Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Evaluate LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Lakera Team

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Master Prompt Injection Attacks.

Discover risks and solutions with the Lakera LLM Security Playbook.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

You might be interested
min read
Machine Learning

Your validation set won’t tell you if a model generalizes. Here’s what will.

As we all know from machine learning 101, you should split your dataset into three parts: the training, validation, and test set. You train your models on the training set. You choose your hyperparameters by selecting the best model from the validation set. Finally, you look at your accuracy (F1 score, ROC curve...) on the test set. And voilà, you’ve just achieved XYZ% accuracy.
Václav Volhejn
November 13, 2024
min read
Machine Learning

Test machine learning the right way: Detecting data bugs.

In this second instance of the testing blog series, we deep dive into data bugs: what do they look like, and how can you use specification and testing to ensure you have the right data for the job?
Mateo Rojas-Carulla
November 13, 2024
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.