Releasing Canica: A Text Dataset Viewer
Discover Canica, Lakera's interactive text dataset viewer that elevates data analysis with visual exploration tools like t-SNE and UMAP. Now available for the machine learning community under the MIT license.
Discover Canica, Lakera's interactive text dataset viewer that elevates data analysis with visual exploration tools like t-SNE and UMAP. Now available for the machine learning community under the MIT license.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
In-context learning
As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.
[Provide the input text here]
[Provide the input text here]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?
Title italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
Lorem ipsum dolor sit amet, line first
line second
line third
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?
Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
At Lakera, we collect huge datasets of text that we use to train, test, and improve our models, but a model can only be as good as the data we train it on.
We developed canica, a text dataset viewer, to help us understand the quality of our datasets.
Canica consumes some text and its corresponding embeddings and allows you to interactively explore it as a 2D plot using algorithms like t-distributed Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP).
This tool is already a valuable part of our workflows, and as part of our efforts to help the machine learning community, we are releasing canica under the MIT license.
The source code is available on GitHub, and the canica package has been published to the Python Package Index, so you can install it right now via pip.
You may be wondering why we call it canica. In Spanish, canica means marble (the toy). During the development of canica we did some experiments showing the t-SNE optimisation process in real-time. It looked like a group of marbles bouncing around, hence the name. Plus, I think canica has a nice ring to it (doesn’t it?).
Let’s take a look at a well-known dataset of Amazon reviews and filter it down to 1000 reviews in English and German, and generate text embeddings for these reviews using OpenAI's embeddings API.
This plot shows two clusters of reviews, with English reviews in orange and German reviews in blue. They are mostly separated, but some reviews end up in the other language's region.
We could ask many questions about this dataset, like why are some points surrounded by points of the other color?
Hovering over a point will give us more information about it. The point we’ve highlighted is a German review of a Nokia phone cover with two English reviews surrounding it.
“Not convenient to store your phone, pouch is too small.. The pouch is too small for an iphone and not really…”
“This bluetooth is very handy and easy to carry around. This bluetooth is very handy and easy to carry around. What I like about this device is that Selfie don't need to raise your hand far just to get a good picture…”
All these reviews are for phone accessories. Points share semantic similarities with nearby points, even though the larger overall clusters correspond to different languages, which means that our embeddings represent semantic information similarly across languages.
We often faced a challenge when using tools like matplotlib or plotly to plot t-SNE results: there was no easy solution that could help us relate the 2D space back to the original embedding space.
Dimensionality reduction is great for simplifying data, but it can leave out crucial context, which can make it harder to grasp the structure of your data.
One of the unique features of canica is that it lets us explore the neighbourhoods in the original embedding space through the 2D plot by clicking on a point to highlight the nearest neighbours of the selected point and adjusting the number of neighbours using the slider.
This gives us a better idea of how the dimensionality reduction works and which information we see in the resulting plot. Hovering over the highlighted points allows us to understand how our embeddings work and which information they contain.
Canica also allows you to focus on a specific subset of your data.
After selecting a data point and adjusting the neighbour count, the re-plot button will rerun the dimensionality reduction algorithm on the selected subset. The plot will rerender, and we can investigate specific subsets that may not have been clear before.
Thanks to feedback from our internal Lakera users, you can see that canica highlights the last focused point to help you keep track of this process.
We’re excited to share canica with you and always welcome your feedback and contributions.
You can discover more about canica and how it can enhance your approach to data analysis by exploring the tutorial notebook in our GitHub.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.
Compare the EU AI Act and the White House’s AI Bill of Rights.
Get Lakera's AI Security Guide for an overview of threats and protection strategies.
Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.
Use our checklist to evaluate and select the best LLM security tools for your enterprise.
Discover risks and solutions with the Lakera LLM Security Playbook.
Discover risks and solutions with the Lakera LLM Security Playbook.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.