Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy
Back

Test machine learning the right way: Fuzz testing.

In this instance of our ML testing series, we discuss fuzz testing. We discuss what it is, how it works, and how it can be used to stress test machine learning systems to gain confidence before going to production.

Lakera Team
October 20, 2023
October 20, 2023
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

We can now add another testing technique to our Swiss Army knife for ML testing: Fuzz testing. Let’s start by defining what fuzz testing is and by providing a quick overview of the common approaches. Then, we’ll look at how this method can be used to efficiently stress test your ML system and to help uncover robustness issues during development.

What is fuzz testing?

Fuzzing is a well-known technique extensively used in traditional software systems. Wikipedia defines it as follows:

“Fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program.”

Software bugs often appear when problematic inputs are presented to the system. If the logic behind the computer program was not written with these problematic inputs in mind, the software component can crash or behave in undesired ways. Fuzz testing looks for problematic inputs by following an automatic input generation strategy. Thus, problematic input data can be caught early, and the overall system becomes more reliable.

This idea extends naturally to computer vision and other ML systems. In particular for computer vision, the input space is extremely large. Problematic input data are likely to exist, but the relevant ones can be hard to find.

“For computer vision, the input space is extremely large. Problematic input data are likely to exist, but the relevant ones can be hard to find.”

We’ll begin by explaining how fuzz testing works and by providing a few examples from research. Then, we’ll look in more detail at how it can be used to test computer vision systems in practice.

How can we smartly generate new inputs?

If the core component of fuzz testing is finding problematic inputs, the key question becomes: how do we go about actually finding these new inputs? These particular inputs that cause the machine learning models to misbehave?

The first idea is, of course, to use a fully random search. We could generate inputs by modifying pixel values randomly until something breaks. However, this has several shortcomings. First, it is very inefficient because finding relevant failure cases can be difficult and expensive.

Here, adding synthetic fog leads to a significant change in identifying maintenance checks or requirements for on-site AI software or robots.

Secondly, what does ‘until something breaks’ mean? Machine learning testing is difficult, ML systems often fail silently. Relevant bugs are subtle and challenging to find and don’t usually lead to a program ‘crashing’. The ML system may instead decide that a dog in an image is now a cat without raising any alarms.

ML systems often fail silently. Relevant bugs are subtle and challenging to find and don’t usually lead to a program ‘crashing’.

Finally, the input may quickly become semantically meaningless in the context of the application, thus going beyond where the system is expected to perform well. It could be interesting to still test for such inputs.

Why? Here is an example, a camera might break or a bad connection might lead to random-looking images. In this case you still want the system to fail gracefully.

Interlude: How do you know if your system is failing?

Before we continue, let’s take a look at how to evaluate whether the machine learning system is actually failing. The concept of metamorphic relations developed in the previous blog of this ml testing series becomes very useful for this purpose. These are variations to the input image that change a known label in a predictable way. For example, often the output of a classification problem should not depend on how the image is rotated: a rotated dog is still a dog.

This notion can be used as a tool for fuzz testing. As long as the operations performed to modify an input lead to well-understood changes in the label, we can establish whether a new input ‘breaks’ the system.

A few examples of fuzz testing techniques

To apply fuzz testing successfully, we need to be more efficient than a fully random search. Most approaches are based on the idea of mutating an initial input based on a specified set of rules and operations.

DLFuzz, for example, focuses on the idea that problematic inputs tend to appear due to low neuron coverage in the trained system. New images that turn on a large subset of neurons that are not activated during training may then lead to unexpected model predictions. DLFuzz modifies input images to activate these rarely visited neurons in order to trigger such failures.

Another approach, DeepHunter, chooses a set of random transformations among a set that preserves image labels. This way, whether a newly generated fuzzy input decreases performance can be evaluated with the original image label. Indeed, if we modify an image by randomly rotating it and expect the labels to remain the same, we can compare the output of the system. The newly rotated images can be checked with the original image labels to decide if there is a failure.

How can fuzz testing be used to test ML systems?

Fuzz testing becomes an essential component to add to our testing suites. It allows us to stress test the system and get a clearer idea of how the system will perform in practice by leveraging a larger, synthetic dataset. Fuzzy stress testing gives us access to images that are likely to arise in practice but are not in the original dataset. Test on much larger datasets using fuzzy stress testing.

Test on much larger datasets using fuzzy stress testing.

Example: Surveying and energy site.

Let’s say that you were building a machine learning system for a robot that was designed to survey a renewable energy site. Data availability often becomes a core challenge when building such systems. It is true we may have enough general images taken on rainy days, and images of wind turbines on sunny days. But images of turbines on rainy days may be scarce.

For complex real-world systems, it’s often impossible to have sufficient coverage for all scenarios that arise in practice. As such, data augmentation techniques are key and a standard go-to technique for anyone building computer vision systems or machine learning models in general.

Fuzz testing can stress test the system within its operational environment to find combinations where the system performs weakly. It can also help to find where further augmentations or data collection should be done.

– Adding random synthetic fog to the image at intensity x.

– Adding random glare of intensity y at position p.

Input images of roundabouts could be fuzzily mutated using these transformations to look for failure cases. This process is very powerful since it can provide a large number of images that may arise in practice but are not present in the existing dataset. More inspiration on how this can be done through a guided approach can be found in DeepHunter.

By running such tests repeatedly throughout the development process, teams can ensure that their systems work as specified. They can also find problematic cases that need further data augmentation and data collection.

Prepare for the unexpected with fuzz.

Fuzz testing can also help to answer the broader question:

– Does my system perform well when presented with ‘unexpected’ images that lie outside of what could be considered its operational environment?

Development teams should ensure that their ml models fail gracefully in such cases.

The complexity of a typical neural network makes it vulnerable to a variety of failures. This includes failures on images that are close in pixel space to images for which the trained models performs well. This has been extensively researched in the field of adversarial examples.

From ‘Explaining and Harnessing Adversarial Examples’ , network performance changes drastically after adding small noise.

Trivial sanity checks, such as testing how machine learning models perform on partially blacked-out images and other such transforms, are essential. Several open-source libraries, such as Albumentations, provide a wide range of such transforms. Such testing of ml models should, therefore, also be added to the complete testing suite of a critical ML component.

Fuzz testing is an interesting testing method that provides a principled way to stress test your ML system. This is key to understanding whether 1) your system performs well when it should and 2) your system fails gracefully when presented with challenging inputs. The introduced testing methods, allow you to look for the blind spots of your system during development and prevent them from happening during operation.

Get started with Lakera today.

Get in touch with mateo@lakera.ai to find out more about what Lakera can do for your team, or get started right away.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Explore Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Evaluate LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Lakera Team

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Master Prompt Injection Attacks.

Discover risks and solutions with the Lakera LLM Security Playbook.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

You might be interested
min read
Machine Learning

Why testing should be at the core of machine learning development.

AI (artificial intelligence) is capable of helping the world scale solutions to our biggest challenges but if you haven’t experienced or heard about AI’s mishaps then you’ve been living under a rock. Coded bias, unreliable hospital systems and dangerous robots have littered headlines over the past few years.
Lakera Team
November 13, 2024
min read
Machine Learning

Your validation set won’t tell you if a model generalizes. Here’s what will.

As we all know from machine learning 101, you should split your dataset into three parts: the training, validation, and test set. You train your models on the training set. You choose your hyperparameters by selecting the best model from the validation set. Finally, you look at your accuracy (F1 score, ROC curve...) on the test set. And voilà, you’ve just achieved XYZ% accuracy.
Václav Volhejn
November 13, 2024
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.