Back

Why we need better data management for mission-critical AI

In order to enable mission-critical ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.

Mateo Rojas-Carulla
December 4, 2023
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

Why we need better data management for mission-critical AI.

Every day, we rely on data to power complex systems and help solve difficult problems. Data is so ingrained in our lives that we automatically trust its usefulness and accuracy. But we take for granted that data itself isn’t autonomous. It’s managed by people and prone to human bias and errors. Without proper oversight, data inaccuracy and misuse can become rampant and cause serious problems. As we move towards increasingly complex interactions between data and algorithms in machine learning (ML) systems, the need for the correct management of data becomes clearer than ever. In order to unlock the full potential of ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.

“Now that the models have advanced to a certain point, we got to make the data work as well.”

— Andrew Ng

Even for traditional software programs, in which data is just an input that is processed in an explicitly programmed way, data needs to be considered carefully, and it needs to be on an equal footing alongside software, hardware, and human factors. Failure to accomplish this has already led to dramatic consequences. Most recently, Public Health England mismanaged data in a Microsoft Excel file [1] and silently deleted COVID-19 test results, causing 16,000 cases to go unreported for days. As a consequence, the contacts of these positive cases were not notified or asked to self-isolate on time.

Why is data so difficult to manage?

Managing massive and complex amounts of data presents a unique set of challenges that we are still addressing today. Here are a few of the reasons that make data management so challenging [2]:

  1. Data is fluid: while hardware and software change infrequently in production, data is constantly evolving. This makes it difficult to revisit safety cases.
  2. Data is reused: the same data is often reused in a different system or at a different time, making the downstream impact of data changes challenging to forecast.
  3. Data is transformed: data is filtered, aggregated, and changed, which can cause information to be lost.
  4. Data is incomplete: often not all necessary data is available, so systems are built with incomplete data.
  5. Data is biased: there is systematic inaccuracy in data due to the way it was created, collected, manipulated, presented, or interpreted.

The resulting mishandling of data can cause serious accidents in mission-critical systems in healthcare and autonomous driving, for example. The numerous challenges associated with data help us understand why Ken Thompson notoriously declared:

“I wanted to separate data from programs because data and instructions are very different.”

— Ken Thompson

Given the unique properties and challenges of large-scale data management, it becomes crucial to develop systematic methodologies and infrastructure that address it separately from software and hardware.

{{Advert}}

The consequences of data problems in ML systems.

We have seen that data mismanagement can create serious issues, even in traditional software systems where it is handled deterministically. Now, think about how much more challenging the management of data is in ML, where computers learn program behavior from data. The line between data and programs becomes blurrier, and it becomes more important to account for all the unique properties of data in the first place.

Biased datasets, for instance, have had serious ramifications in critical applications. In the past, courts in the United States have used software to forecast potential recidivism risk based on data from the defendants and from criminal records. An investigation [3] found that while the tool correctly predicts recidivism 61% of the time, Black individuals were almost twice as likely as white individuals to be labeled as high-risk without actually going on to re-offend. Conversely, white individuals were much more likely than Black individuals to be labeled as low-risk but end up re-offending.

The software’s creators didn’t fully take into account the effects of dataset bias on predictions, leading to unacceptable social consequences. These effects must be explicitly tested to prevent catastrophes such as wrongful sentencing. When we are talking about criminal justice, healthcare, or other critical applications, good data governance is an urgent priority.

How we govern data today.

To address the challenges associated with data in AI systems, the EU included “data and data governance” in its latest AI package [4]. In Article 10 (page 48), they require that:

“Training, validation and testing data sets shall be relevant, representative, free of errors and complete. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons on which the high-risk AI system is intended to be used.”

The Data Safety Guidance (Table 5, page 28) proposes data properties that are not industry- or application-specific. [2]
The Data Safety Guidance (Table 5, page 28) proposes data properties that are not industry- or application-specific. [2]

This is a good start. However, such data guidance will have to become more concrete – and quickly. We need a more formal definition of the data properties that are needed for mission-critical systems to operate safely. A good starting point for analyzing the quality of data is the set of data properties proposed in the Data Safety Guidance [2].

These properties are not industry- or application-specific and can be used to establish which aspects of the data need to be investigated and guaranteed so that the system can operate safely. A failure to consider any of these properties could pose additional risks to the operation of the system. The authors note that the list is not exhaustive. Instead, it needs to be carefully adapted for specific applications.

What we need to do next.

This level of data oversight needs to become a core part of the ML development process. We need it to drive future requirements and become compliance artifacts for high-risk applications. It’s our responsibility as a community to implement this change. Between regulators, startups, industry players, and researchers – all of us have to work together to build a more principled discipline around data management. This is the only way in which we can obtain an in-depth understanding of data during both development and operation. This will allow us to assess the safety of mission-critical applications. If we don’t take these steps, the number of AI accidents caused by mismanaged data or noncompliant data usage will continue to rise in the future. Unlocking the full potential of ML software in applications with low tolerance for failure requires us to take action now.

[1] Excel: Why using Microsoft's tool caused Covid-19 results to be lost, BBC, 2020
[2]
Data Safety Guidance (Version 3.3), SCSC Data Safety Initiative Working Group, 2021
[3]
Machine Bias, ProPublica, 2016
[4] Artificial Intelligence Act, European Commission, 2021

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Mateo Rojas-Carulla
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download
You might be interested
min read
Machine Learning

Test machine learning the right way: Detecting data bugs.

In this second instance of the testing blog series, we deep dive into data bugs: what do they look like, and how can you use specification and testing to ensure you have the right data for the job?
Mateo Rojas-Carulla
December 1, 2023
12
min read
Machine Learning

The ELI5 Guide to Retrieval Augmented Generation

Discover the inner workings of Retrieval Augmented Generation (RAG) and how it enhances language model responses by dynamically sourcing information from external databases.
Blessin Varkey
December 1, 2023
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.