Why we need better data management for mission-critical AI
In order to enable mission-critical ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.
In order to enable mission-critical ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
In-context learning
As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.
[Provide the input text here]
[Provide the input text here]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?
Title italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
Lorem ipsum dolor sit amet, line first
line second
line third
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?
Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
Every day, we rely on data to power complex systems and help solve difficult problems. Data is so ingrained in our lives that we automatically trust its usefulness and accuracy. But we take for granted that data itself isn’t autonomous. It’s managed by people and prone to human bias and errors. Without proper oversight, data inaccuracy and misuse can become rampant and cause serious problems. As we move towards increasingly complex interactions between data and algorithms in machine learning (ML) systems, the need for the correct management of data becomes clearer than ever. In order to unlock the full potential of ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.
“Now that the models have advanced to a certain point, we got to make the data work as well.”
— Andrew Ng
Even for traditional software programs, in which data is just an input that is processed in an explicitly programmed way, data needs to be considered carefully, and it needs to be on an equal footing alongside software, hardware, and human factors. Failure to accomplish this has already led to dramatic consequences. Most recently, Public Health England mismanaged data in a Microsoft Excel file [1] and silently deleted COVID-19 test results, causing 16,000 cases to go unreported for days. As a consequence, the contacts of these positive cases were not notified or asked to self-isolate on time.
Managing massive and complex amounts of data presents a unique set of challenges that we are still addressing today. Here are a few of the reasons that make data management so challenging [2]:
The resulting mishandling of data can cause serious accidents in mission-critical systems in healthcare and autonomous driving, for example. The numerous challenges associated with data help us understand why Ken Thompson notoriously declared:
“I wanted to separate data from programs because data and instructions are very different.”
— Ken Thompson
Given the unique properties and challenges of large-scale data management, it becomes crucial to develop systematic methodologies and infrastructure that address it separately from software and hardware.
{{Advert}}
We have seen that data mismanagement can create serious issues, even in traditional software systems where it is handled deterministically. Now, think about how much more challenging the management of data is in ML, where computers learn program behavior from data. The line between data and programs becomes blurrier, and it becomes more important to account for all the unique properties of data in the first place.
Biased datasets, for instance, have had serious ramifications in critical applications. In the past, courts in the United States have used software to forecast potential recidivism risk based on data from the defendants and from criminal records. An investigation [3] found that while the tool correctly predicts recidivism 61% of the time, Black individuals were almost twice as likely as white individuals to be labeled as high-risk without actually going on to re-offend. Conversely, white individuals were much more likely than Black individuals to be labeled as low-risk but end up re-offending.
The software’s creators didn’t fully take into account the effects of dataset bias on predictions, leading to unacceptable social consequences. These effects must be explicitly tested to prevent catastrophes such as wrongful sentencing. When we are talking about criminal justice, healthcare, or other critical applications, good data governance is an urgent priority.
To address the challenges associated with data in AI systems, the EU included “data and data governance” in its latest AI package [4]. In Article 10 (page 48), they require that:
“Training, validation and testing data sets shall be relevant, representative, free of errors and complete. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons on which the high-risk AI system is intended to be used.”
This is a good start. However, such data guidance will have to become more concrete – and quickly. We need a more formal definition of the data properties that are needed for mission-critical systems to operate safely. A good starting point for analyzing the quality of data is the set of data properties proposed in the Data Safety Guidance [2].
These properties are not industry- or application-specific and can be used to establish which aspects of the data need to be investigated and guaranteed so that the system can operate safely. A failure to consider any of these properties could pose additional risks to the operation of the system. The authors note that the list is not exhaustive. Instead, it needs to be carefully adapted for specific applications.
This level of data oversight needs to become a core part of the ML development process. We need it to drive future requirements and become compliance artifacts for high-risk applications. It’s our responsibility as a community to implement this change. Between regulators, startups, industry players, and researchers – all of us have to work together to build a more principled discipline around data management. This is the only way in which we can obtain an in-depth understanding of data during both development and operation. This will allow us to assess the safety of mission-critical applications. If we don’t take these steps, the number of AI accidents caused by mismanaged data or noncompliant data usage will continue to rise in the future. Unlocking the full potential of ML software in applications with low tolerance for failure requires us to take action now.
[1] Excel: Why using Microsoft's tool caused Covid-19 results to be lost, BBC, 2020
[2] Data Safety Guidance (Version 3.3), SCSC Data Safety Initiative Working Group, 2021
[3] Machine Bias, ProPublica, 2016
[4] Artificial Intelligence Act, European Commission, 2021
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.
Compare the EU AI Act and the White House’s AI Bill of Rights.
Get Lakera's AI Security Guide for an overview of threats and protection strategies.
Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.
Use our checklist to evaluate and select the best LLM security tools for your enterprise.
Discover risks and solutions with the Lakera LLM Security Playbook.
Discover risks and solutions with the Lakera LLM Security Playbook.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.