Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Test machine learning the right way: Metamorphic relations.
As part of our series on machine learning testing, we are looking at metamorphic relations. We’ll discuss what they are, how they are used in traditional software testing, what role they play in ML more broadly and lastly, how to use them to write great tests for your machine learning application.
As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.
[Provide the input text here]
[Provide the input text here]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now? Title italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
Lorem ipsum dolor sit amet, line first line second line third
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now? Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
In this part of our machine learning testing series, we’ll look at metamorphic relations — a technique used to multiply your available data and labels. We discuss how they can be used for machine learning model evaluation. Metamorphic relations help extend the test coverage of your ML (machine learning) system beyond what can be achieved through normal data collection. This testing series has previously covered multiple aspects around how to evaluate machine learning models such as testing for data bugs and regression testing.
The test oracle problem.
The test oracle problem is not specific to ML, and it is well known from traditional software [1]. It refers to determining the correct test output for a given test input.
Let’s look at an example from medical imaging. Imagine that you are building an ML system for medical imaging that is used as a diagnostic tool for cancer histopathology. The input = images of histopathology samples. The output = cancer or no cancer diagnosis.
The test oracle problem presents itself because you have some input image data, but you don’t know the label (cancer/no cancer).
This is solved by having the image annotated. You can send these images to histopathologists, who can play the role of the test oracle by adding a label to each sample image. The problem is that these images are scarce to begin with and the ones that you do have will be expensive to annotate.
The combinatorial number of scenarios needed for thorough machine learning evaluation requires more data and labels than can be realistically collected. For example, relevant scenarios become too large when looking at variations in the color of the image, the type of microscope used to take the image, the zoom level, etc. As a result only a part of relevant conditions can be tested for, leading to insufficient test coverage.
In come metamorphic relations. Take the image that you already have and rotate it. You could then send this rotated image to be re-annotated to solve the test oracle problem. But because you know that the label for the rotated image is still cancer, you don’t need to.
That’s how metamorphic relations can contribute to solving the oracle problem.
“Refers to the relationship between the software input change and output change.”
To return to the example of a square function, an easily tested metamorphic relation (ignoring numerical issues) is:
f(-input) = f(input)
This is a powerful concept that can be applied to ML as well! Two classes of metamorphic relations that are well known in computer vision are:
a) Image augmentations (e.g., rotation) that affect the label in a known way and act as a data/label multiplier;
b) Using temporal relations in video sequences (e.g., two successive image frames in a 30Hz video sequence are likely similar) that act as supervisory signals. Both have been applied in the context of (self-)supervised learning to create more robust ML models [3].
How can we leverage this concept for model testing in machine learning?
Example: Using metamorphic relations for medical image testing.
We illustrate the use of metamorphic relations when looking at how to test machine learning models for our histopathology example. We can make use of metamorphic relations to write model unit tests and increase the test coverage in our ML testing suites.
We’d certainly expect this ML system to work if the input image is rotated by 180 degrees. Shifts in the color intensity of the image should also not change the system output. Neither should slightly out-of-focus samples.
These problem insights or, in this context, metamorphic relations can be used to create clear test specifications and to build these model unit tests. Not only does this multiply your available test data but it also ensures that your ML model behaves according to the specifications via machine learning unit testing.
So, why bother augmenting your test data if you’re already adding them to your training set? Truth be told, there is no guarantee that adding these augmentations to your training set ensures the desired behavior of your trained model. We observed this on state-of-the-art object detection models which are not robust to augmentations used during training. But testing your model for desired behavior gives confidence for certain inputs and will likely discover and prevent many ML model bugs.
Not convinced? Similar metamorphic relations were applied to the testing of neural networks for autonomous driving by Tian et al. in DeepTest [4]. They found thousands of erroneous (and sometimes grave) behaviors in state-of-the-art deep neural networks for self-driving cars.
To summarize, metamorphic relations are a great way to thoroughly test your ML system. In addition to regression tests, they should not be forgotten in your development cycles when testing ML models. Our follow-up article on fuzz-testing provides illustrations on how to leverage the concept of metamorphic relations to stress-test ML models.
Learn how to protect against the most common LLM vulnerabilities
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Will my system work if image quality starts to drop significantly? If my system works at a given occlusion level, how much stronger can occlusion get before the system starts to underperform? I have faced such issues repeatedly in the past, all related to an overarching question: How robust is my model and when does it break?
As we all know from machine learning 101, you should split your dataset into three parts: the training, validation, and test set. You train your models on the training set. You choose your hyperparameters by selecting the best model from the validation set. Finally, you look at your accuracy (F1 score, ROC curve...) on the test set. And voilà, you’ve just achieved XYZ% accuracy.