Cookie Consent

Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.

Why we need better data management for mission-critical AI

In order to enable mission-critical ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.

Mateo Rojas-Carulla

October 20, 2023

Last updated:

November 13, 2024

On this page

Hide table of contents

Show table of contents

Why we need better data management for mission-critical AI.

Every day, we rely on data to power complex systems and help solve difficult problems. Data is so ingrained in our lives that we automatically trust its usefulness and accuracy. But we take for granted that data itself isn’t autonomous. It’s managed by people and prone to human bias and errors. Without proper oversight, data inaccuracy and misuse can become rampant and cause serious problems. As we move towards increasingly complex interactions between data and algorithms in machine learning (ML) systems, the need for the correct management of data becomes clearer than ever. In order to unlock the full potential of ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.

“Now that the models have advanced to a certain point, we got to make the data work as well.”

— Andrew Ng

Even for traditional software programs, in which data is just an input that is processed in an explicitly programmed way, data needs to be considered carefully, and it needs to be on an equal footing alongside software, hardware, and human factors. Failure to accomplish this has already led to dramatic consequences. Most recently, Public Health England mismanaged data in a Microsoft Excel file [1] and silently deleted COVID-19 test results, causing 16,000 cases to go unreported for days. As a consequence, the contacts of these positive cases were not notified or asked to self-isolate on time.

Why is data so difficult to manage?

Managing massive and complex amounts of data presents a unique set of challenges that we are still addressing today. Here are a few of the reasons that make data management so challenging [2]:

Data is fluid: while hardware and software change infrequently in production, data is constantly evolving. This makes it difficult to revisit safety cases.
Data is reused: the same data is often reused in a different system or at a different time, making the downstream impact of data changes challenging to forecast.
Data is transformed: data is filtered, aggregated, and changed, which can cause information to be lost.
Data is incomplete: often not all necessary data is available, so systems are built with incomplete data.
Data is biased: there is systematic inaccuracy in data due to the way it was created, collected, manipulated, presented, or interpreted.

The resulting mishandling of data can cause serious accidents in mission-critical systems in healthcare and autonomous driving, for example. The numerous challenges associated with data help us understand why Ken Thompson notoriously declared:

“I wanted to separate data from programs because data and instructions are very different.”

— Ken Thompson

Given the unique properties and challenges of large-scale data management, it becomes crucial to develop systematic methodologies and infrastructure that address it separately from software and hardware.

The consequences of data problems in ML systems.

We have seen that data mismanagement can create serious issues, even in traditional software systems where it is handled deterministically. Now, think about how much more challenging the management of data is in ML, where computers learn program behavior from data. The line between data and programs becomes blurrier, and it becomes more important to account for all the unique properties of data in the first place.

Biased datasets, for instance, have had serious ramifications in critical applications. In the past, courts in the United States have used software to forecast potential recidivism risk based on data from the defendants and from criminal records. An investigation [3] found that while the tool correctly predicts recidivism 61% of the time, Black individuals were almost twice as likely as white individuals to be labeled as high-risk without actually going on to re-offend. Conversely, white individuals were much more likely than Black individuals to be labeled as low-risk but end up re-offending.

The software’s creators didn’t fully take into account the effects of dataset bias on predictions, leading to unacceptable social consequences. These effects must be explicitly tested to prevent catastrophes such as wrongful sentencing. When we are talking about criminal justice, healthcare, or other critical applications, good data governance is an urgent priority.

How we govern data today.

To address the challenges associated with data in AI systems, the EU included “data and data governance” in its latest AI package [4]. In Article 10 (page 48), they require that:

“Training, validation and testing data sets shall be relevant, representative, free of errors and complete. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons on which the high-risk AI system is intended to be used.”

*The Data Safety Guidance (Table 5, page 28) proposes data properties that are not industry- or application-specific. [2]*

This is a good start. However, such data guidance will have to become more concrete – and quickly. We need a more formal definition of the data properties that are needed for mission-critical systems to operate safely. A good starting point for analyzing the quality of data is the set of data properties proposed in the Data Safety Guidance [2].

These properties are not industry- or application-specific and can be used to establish which aspects of the data need to be investigated and guaranteed so that the system can operate safely. A failure to consider any of these properties could pose additional risks to the operation of the system. The authors note that the list is not exhaustive. Instead, it needs to be carefully adapted for specific applications.

What we need to do next.

This level of data oversight needs to become a core part of the ML development process. We need it to drive future requirements and become compliance artifacts for high-risk applications. It’s our responsibility as a community to implement this change. Between regulators, startups, industry players, and researchers – all of us have to work together to build a more principled discipline around data management. This is the only way in which we can obtain an in-depth understanding of data during both development and operation. This will allow us to assess the safety of mission-critical applications. If we don’t take these steps, the number of AI accidents caused by mismanaged data or noncompliant data usage will continue to rise in the future. Unlocking the full potential of ML software in applications with low tolerance for failure requires us to take action now.

[1] Excel: Why using Microsoft's tool caused Covid-19 results to be lost, BBC, 2020
[2] Data Safety Guidance (Version 3.3), SCSC Data Safety Initiative Working Group, 2021
[3] Machine Bias, ProPublica, 2016
[4] Artificial Intelligence Act, European Commission, 2021

Mateo Rojas-Carulla

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download

AI Observability: Key to Reliable, Ethical, and Trustworthy AI

AI observability offers deep insights into AI decision-making. Understand how models work, detect bias, optimize performance, and ensure ethical AI practices.

Brain John Aboze

November 13, 2024

min read

•

Machine Learning

Fuzz Testing for Machine Learning: How to Do It Right

In this instance of our ML testing series, we discuss fuzz testing. We discuss what it is, how it works, and how it can be used to stress test machine learning systems to gain confidence before going to production.

Lakera Team

March 27, 2025

Activate
untouchable mode.

Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Book a demo Start for free

Join our Slack Community.

Several people are typing about AI/ML security.  Come join us and 1000+ others in a chat that’s thoroughly SFW.

Join Lakera Momentum Slack

Why we need better data management for mission-critical AI

Why we need better data management for mission-critical AI.

Why is data so difficult to manage?

The consequences of data problems in ML systems.

How we govern data today.

What we need to do next.

Unlock Free AI Security Guide.

Explore Prompt Injection Attacks.

Learn AI Security Basics.

Evaluate LLM Security Solutions.

Uncover LLM Vulnerabilities.

The CISO's Guide to AI Security