We looked at real-world ML from multiple angles, all the way from how to best start ML projects to the challenges of scaling ML products and teams. And of course, we couldn’t miss out on the big developments in the world of foundation models.
You can access the recording of the webinar on Youtube: Live vs. ImageNet - Lakera Webinar and below. Continue reading for a summary and main takeaways.
Richard Shen (Wayve)
Product management at Wayve, a deep tech London startup building end-to-end systems for autonomous driving.
Tom Dyer (Genomics England)
ML engineer at Genomics England, working on the 100,000 genome project. He has previously built radiology AI systems that have been deployed across the NHS for real-world clinical care.
Peter Shulgin (Covariant)
Previously solutions at Covariant AI, working closely with customers across industries and building the deep quality processes required to integrate AI technologies at scale.
Paul Rubenstein (Google)
Worked at Apple on building on-device Computer Vision (CV) products at scale. He is now an applied researcher at Google.
Mateo Rojas Carulla (Lakera, Moderator)
Founder of Lakera, PhD in machine learning, witnessed the rapid growth of ML capabilities over the last decade. Interested in how to bring this safely into production systems. Previously worked at Google on Google Search and was exposed to learning how to deliver reliably at scale.
Changing the mentality away from ML101 is very challenging. The day-to-day challenges of evaluating CV systems for real-world applications are very different from the evaluation standards widely used in academia (e.g. evaluation on validation datasets). The applications promised by developments in AI technologies are very powerful, but there is a huge gap in making them useful and reliable for everyday use. There are major challenges in moving from POC to products on the streets, in hospitals and in people's homes.
đź’ˇ In academia the focus is on models. In industry, the focus is on the data.
đź’ˇ The world is constantly changing and you need to ensure that your datasets are representative of this dynamic world.
đź’ˇ Averaging accuracy across classes for a classification model is meaningless, there is too much variation and uncertainty.
đź’ˇ A wide variety of metrics is informative, there is no need to look at them all the time but have to possibility to drill down if needed.
💡 One metric is not enough. However, you need a metric that drives decisions, too many metrics can become overwhelming. You can create hierarchies of metrics, for example, a “safety critical metric” to help decision-making.
đź’ˇ Think from a product perspective, what can be translated in product playbooks and a north star. Is your product serving the use cases it is meant to serve?
Adopting a product-first mentality is fundamental. While academia often rewards “big ticket items” that lead to increased model performance, product work should focus much more on understanding what the user really needs and what product features need to be built. Being pragmatic and understanding what is “good enough” to prevent getting caught in endless iterations is also a key driver of success.
đź’ˇ In clinical ML, you have to build with many constraints. For example, a model can have adequate performance but is too slow once deployed.
It is absolutely a challenge that many companies are still trying to figure out. Within a company, open-ended research teams can create significant value, but should in general be kept separate from production and engineering teams. This is not new to ML and has been a valuable interplay across disciplines (oil and gas, industrial, etc).
đź’ˇ In academia problems are framed in terms of validation, test, and training sets, but in the "real world", you are not given a fixed training set. You have requirements and have to figure out the rest as you go. The test dataset is constantly changing. You want to adapt and inform your product as it grows and new requirements come in.
Converging on the right evaluation metrics early is key. After having a viable prototype, it’s important to be dynamic and get your system in the hands of real people, as you will be surprised by how they actually end up using it. The focus of development should be on probing your model and understanding patterns in the failures. For example, understand that your model fails in low-contrast images.
đź’ˇ Agreeing early on what success means is critical.
đź’ˇ When you deploy a model into the world, it will be used in ways you will not imagine beforehand.
đź’ˇ It is easy to fall into the trap of a continuous iteration loop, never quite getting there.
đź’ˇ How are you channeling information and learnings to scale further and further?
Having traceability, data tracking, and reproducibility is important. You want to be able to dissect everything no matter what scale you are at, going deep into the performance and the robustness of your systems. A lot of the manual complexity involved should be removed from the developer so they can focus on what matters.
đź’ˇ "Looking back, I am surprised at the amateurish level of research code, it was a culture shock for me to see how academia compares to production engineering".
In healthcare, you are growing teams as you scale the project. Initial systems are often built from small datasets, from single hospitals or regions. But is your model robust and stable across hospitals? For example, different hospitals have different conditions and different patient demographics. Scaling to larger use cases requires a dynamic team where responsibilities are constantly switching and new skills are required.
It is also important to have validation pipelines that go beyond aggregate performance, looking at deep performance and robustness metrics of the models. In particular, teams should define and test against expected behaviors in a granular way. This helps to monitor the models and how they will scale as you grow.
đź’ˇ The more you deploy, the more errors you encounter, the more you can test behaviors at finer granularities.
While not all industries are alike and have different issues (e.g. construction vs. medical), they share common challenges. Speed of deployment is often one. Another is to be able to explain very transparently to the end user how the system has been tested to build confidence.
An additional challenge arises when building "generic" systems for multiple use cases. For example, in a warehouse you may interpret a given object as a single box, but in another use case you may want to classify it as a pallet of 500 boxes. This is a challenge, how do you train a model with such a variety of use cases and expectations?
💡 The challenge with a “general” AI is that the same situations lead to different expected behaviors depending on the scenario.
Established industries have more process-driven tasks and quality assurance (QA), focusing on building processes and how to run successful pilots. They then formulate criteria of what works and bring it to scale.
💡 “My prediction is traditional companies will not work on the ML engineering side, they will focus on validation and testing.”
While there is a lot of excitement around foundation models, the problems they raise are not new to ML. What are some of these?
First of all, you are working with a model that you did not train yourself, you don’t have access to the data or know the properties the model has learned. As models become more powerful, the barriers for building complex systems come down. People will start building and realize that models are behaving strangely once deployed.
đź’ˇ There is an inherent bias in the data or foundation model which can cause issues for downstream models
💡 Evaluation is important, you don’t know the behaviors that the model has learned.
đź’ˇ Establishing robust evaluation suites for people that are not experts is key to success.
We had a great time during the webinar, it was great to hear from people who have faced the struggles of releasing ML for the real world and to learn from their experiences. Here are just a few of the key takeaways:
We're really looking forward to the next webinar! In the meantime, make sure you join Momentum, our community to discuss all things around AI safety".
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.
Compare the EU AI Act and the White House’s AI Bill of Rights.
Get Lakera's AI Security Guide for an overview of threats and protection strategies.
Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.
Use our checklist to evaluate and select the best LLM security tools for your enterprise.
Discover risks and solutions with the Lakera LLM Security Playbook.
Discover risks and solutions with the Lakera LLM Security Playbook.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.