Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
The List of 11 Most Popular Open Source LLMs of 2024
Discover the top 11 open-source Large Language Models (LLMs) that are shaping the landscape of AI. Explore their features, benefits, and challenges in this comprehensive guide to stay updated on the latest developments in the world of language technology.
As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.
[Provide the input text here]
[Provide the input text here]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now? Title italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
Lorem ipsum dolor sit amet, line first line second line third
Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now? Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic
A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.
English to French Translation:
Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?
In today's digital era, large language models (LLMs) have undergone a significant transformation. They've progressed from struggling with human speech intricacies to generating text that closely resembles human writing. These LLMs now excel not only in contextual conversations but also in programming tasks.
The beginnings of LLMs are closely tied to the open-source movement. Pioneering minds and scholars recognized the potential within these models, while understanding the substantial computing resources needed to train them.
This led to the emergence of open-source alternatives, providing practical options for researchers and developers. In this article, we'll explore the top 11 open-source LLMs, comparing their capabilities. We'll also delve into LLM leaderboards and offer guidance on choosing the right LLM for your needs.
Here’s what we’ll cover:
Open source LLMs examples
Leaderboards to Compare LLMs
Open source model development challenges
Choosing the right LLM for your use case: Best practices
What's Next for Open Source LLMs: Summary
But before that…
**💡 Pro tip: Looking for a reliable tool to protect your LLM applications? We've got you covered! Try Lakera Guard for free.**
Now, let’s dive in!
Exploring Popular Open-Source LLMs
While several proprietary LLMs have carved their niche, the open-source arena is bustling with innovation, presenting models that are not only powerful but also accessible to a broader audience.
Let’s take a look.
1. Llama 2
Llama 2 is a cutting-edge collection of pre-trained and fine-tuned generative text models. The series offers models ranging from 7 billion to 70 billion parameters, making it a state-of-the-art tool. Llama-2-Chat, the fine-tuned versions, are designed explicitly for dialogue applications and have been optimized to provide superior performance compared to open-source chat models. They have been evaluated by humans and have received high marks in both helpfulness and safety, putting them on par with popular closed-source models like ChatGPT and PaLM.
Here are the details of this model:
Parameters: 7B, 13B, and 70B
License: Custom commercial license available at Meta's website.
Training Database: Llama 2 was pre-trained on 2 trillion tokens from public data, then fine-tuned with over a million human-annotated instances and public instruction datasets. Meta claims that no Meta user data was used in either phase.
Variants: Llama 2 is available in multiple parameter sizes, including 7B, 13B, and 70B. Both pre-trained and fine-tuned variations are available.
Fine-tuning Techniques: The model employs supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to better align with human preferences, ensuring helpfulness and safety.
2. OpenLLaMA
OpenLLaMA is an open-source replica of Meta AI's famous LLaMA model. The creators of OpenLLaMA have made this permissively licensed model available to the general public. With 7 billion to 65 billion parameters, the OpenLLaMA model is trained on 200 billion tokens.
Training Database: OpenLLaMA was trained using the RedPajama dataset, which has over 1.2 trillion tokens. The developers followed the same preprocessing and training hyperparameters as the original LLaMA paper.
Fine-tuning Techniques: The OpenLLaMA has the same model architecture, context length, training steps, learning rate schedule, and optimizer as the original LLaMA paper. The main difference between OpenLLaMA and the original LLaMA is the dataset used for training.
3. Falcon
Falcon models were developed by the Technology Innovation Institute in Abu Dhabi. The Falcon family of language models is groundbreaking and state-of-the-art, with the Falcon-40B being the most notable and could compete with multiple close-source LLMs.
Falcon-40B: A heavyweight in the Falcon family, model is powerful and efficient, outperforming the LLaMA-65B with 90GB of GPU memory.
Falcon-7B: Falcon-7B is a top-performing, smaller version that only needs 15GB for consumer hardware.
Training Database: The Falcon-7B and Falcon-40B models have undergone extensive training using vast data, with 1.5 trillion and 1 trillion tokens, respectively. The primary training data for these models is the RefinedWeb dataset, which includes over 80% of their training material. This dataset is a massive web collection based on CommonCrawl, emphasizing quality and scale.
Techniques Used for Fine-Tuning: Falcon models use multiquery attention to share keys and values for improved inference scalability.
System Requirements: Falcon-40B: Requires ~90GB of GPU memory, and Falcon-7B: Requires ~15GB of GPU memory.
Package Version Requirements: For optimal performance, it's recommended to use the bfloat16 datatype, which requires a recent version of CUDA and is best suited for modern graphics cards.
Dolly, officially known as dolly-v2-12b, is an instruction-following large language model developed by Databricks. This model has been trained on about 15,000 instruction/response fine-tuning records created by Databricks employees using the pythia-12b model on the Databricks machine learning platform.
It covers a range of capability domains from the InstructGPT paper, such as brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. Although Dolly is not considered a state-of-the-art model, especially after Databricks acquired MosaicML, it displays exceptional instruction-following behaviour that is not typical of the foundation model it is built upon.
Variants: There are two versions of Dolly: Dolly-v2-7b, which has 6.9 billion parameters and is based on Pythia-6.9b, and Dolly-v2-3b, which has 2.8 billion parameters and is based on Pythia-2.8b.
Database Used for Training: The dataset used for training the model is databricks-dolly-15k. This dataset contains fine-tuning records created by Databricks employees.
Techniques Used for Fine-Tuning: The model was fine-tuned using data from various domains per the InstructGPT paper.
5. MPT
The MosaicML company has developed MPT-30B, a decoder-based transformer pre-trained on 1T tokens of both English text and code. It's part of the Mosaic Pretrained Transformer (MPT) series, designed for efficient fine-tuning and LLM deployment. MPT-30B boasts features like an 8k token context window, context-length extrapolation through ALiBi, and the FlashAttention mechanism for fast training and inference.
The model is compatible with both HuggingFace and NVIDIA's FasterTransformer, and its size is optimized for deployment on single GPU setups. The MosaicML NLP team developed MPT-30B on their platform using the LLM codebase found in the llm-foundry repository which is recommended for fine-tunning and inference.
Variants: There are two models available: MPT-7B and MPT-30B. Each model comes with an instruction and a chat version.
Database Used for Training: 1T tokens of English text and code
System Requirements: The model could be deployed on a single GPU, which could be either 1xA100-80GB in 16-bit precision or 1xA100-40GB in 8-bit precision.
Package Version Requirements for Training: MosaicML recommends utilizing the MosaicML llm-foundry repository to train and fine-tune the model for optimal results. It's worth noting that the MPT-30B tokenizer used in the training process is identical to the EleutherAI/gpt-neox-20b tokenizer.
The Guanaco model is an LLM that utilizes the LoRA fine-tuning technique Tim Dettmers, and the UW NLP team developed. With the help of QLoRA, it's possible to fine-tune a 65B parameter model on a 48GB GPU without sacrificing performance compared to a 16-bit model.
The Guanaco series has outperformed all previous models. Since these models come from the LLaMA model series, they are suitable for commercial use. Although this LLM is not the most advanced model available on the market, this LLM introduces the QLoRA method, which offers an efficient fine-tuning technique and enables personal and smaller businesses to fine-tune large models with up to 65 billion parameters.
BLOOM, which stands for BigScience Language Open-science Open-access Multilingual, is a powerful language model that uses large computational resources to generate text based on a given prompt. This model is the biggest in the list, with around 176 billion parameters.
It can produce coherent text almost indistinguishable from human-generated content in 46 natural languages and 13 programming languages. When given input text, BLOOM can continue the text to generate relevant continuations by examining the preceding words.
While the direct application of BLOOM is primarily for text generation, the model can be adapted for tasks such as Information Extraction, Question Answering, and text summarization by framing them as text generation tasks.
Compute Infrastructure: This model was trained on the Jean Zay Public Supercomputer with 416 A100 80GB GPUs, 384 across 48 nodes, each with 8 GPUs connected through NVLink 4 inter-gpu connections and 4 OmniPath links. Each node has 512GB of RAM and the GPU has 640GB. Megatron-DeepSpeed, DeepSpeed, PyTorch, and apex are used to train this model.
8. Stanford Alpaca
Alpaca is a language model that follows instructions and generates outputs based on provided data. It has been fine-tuned from a 7B LLaMA model using 52K instruction-following data. In preliminary human evaluations, Alpaca has shown behaviour similar to the text-davinci-003 model in the Self-Instruct instruction-following evaluation suite.
Training Database: The model was fine-tuned on 52K instruction data using modified techniques from the Self-Instruct paper. Data generation leveraged text-davinci-003, a simplified pipeline, and produced one instance per instruction. Fine-tuning employed the Hugging Face training code.
9. OpenChatKit
OpenChatKit is an open-source toolset that empowers users to create general and specialized chatbot applications. One of the models developed in this platform is the GPT-NeoXT-Chat-Base-20B-v0.16, an LLM with 20B parameters.
This model is fine-tuned from EleutherAI's GPT-NeoX and focuses on dialogue-style interactions. Its primary function is to perform tasks like answering questions, classification, extraction, and summarization. The model has undergone extensive training with over 40 million instructions on 100% carbon-negative computing.
Training Database: The model has been enhanced with a set of 43 million top-notch instructions. The exact datasets utilized can be found in the togethercomputer/OpenDataHub repository.
Fine-tuning Techniques: This model has been enhanced and fine-tuned using EleutherAI's GPT-NeoX and feedback data, resulting in better adaptation for human conversation.
System Requirements: To run the GPT-NeoXT-Chat-Base-20B model, a minimum of 41GB of free VRAM is required, with each prompt consuming an additional 100-200 MB. Based on its guide, it is recommended to follow consumer hardware guidelines and use at least one GPU for the operation, although inference can be done with less than 48GB of VRAM.
10. GPT4All
GPT4All is an ecosystem for training and deploying large language models. These models can run locally on CPUs that are designed for consumer use. This system is an assistant-style language model that is instruction-tuned and can be used, distributed, and built upon by anyone, whether they are an individual or belong to an enterprise.
This ecosystem enables users to create and use language models specific to their requirements. These models can operate efficiently on standard CPUs without requiring an internet connection or GPU. Direct installer links are available for macOS, Windows, and Ubuntu.
Here are more details about their models:
Parameters: The model size ranges from 3GB to 8GB and given typical sizes, it could range between 7B to 13B.
Fine-tuning Techniques: The GPT4All software ecosystem supports multiple Transformer architectures, including Falcon, LLaMA (including OpenLLaMA), MPT (including Replit), and GPT-J.
11. FLAN-T5
FLAN-T5 is an improved version of T5 that is specifically designed for zero-shot and few-shot NLP tasks. With over 1000 additional tasks and multiple languages covered, it is a powerful language model optimized for research purposes, including reasoning and question answering.
Google has released various variants of the model from flan-t5-small with 80 million parameters to flan-t5-xxl with 11 billion parameters. Largest model flan-t5-xxl only support English, German, French languages while smaller models like flan-t5-xl support 50+ languages.
Variants: Google's LAN-T5 has been released in 5 variants: the flan-t5-small with 80M parameters, the flan-t5-base with 250M parameters, the flan-t5-large with 780M parameters, the flan-t5-xl boasting 3B parameters, and the largest, flan-t5-xxl, with 11B parameters.
Fine-tuning Techniques: Based on pretrained T5 Fine-tuned with instructions for enhanced zero-shot and few-shot performance.
System Requirements: The required hardware for this model includes Google Cloud TPU Pods, specifically TPU v3 or TPU v4 with a minimum of four chips. Additionally, the model has been trained using the t5x codebase in conjunction with jax, so these package versions are required.
Each of these 11 LLMs comes with distinctive features and specifications that cater to a range of users. Whether your focus is on portability, performance, or budget-friendliness, you will find a model designed to match your requirements.
However, while open-source options offer great advantages, the process of developing and selecting the right model can pose its challenges. Let’s explore them.
Leaderboards to Compare LLMs: Navigating the Ever-Evolving Landscape
LLMs are always changing, as new models keep appearing. While having lots of options is exciting, it can also be a bit overwhelming for developers, researchers, and tech enthusiasts. To help with this changing landscape, LLM leaderboards give us a clear picture of how different language models perform.
The HuggingFace Open LLM Leaderboard is a platform designed to track, rank and assess LLMs and chatbots as they gain popularity. It is unique because it is open to the community, allowing anyone to submit their model for automatic evaluation on the HuggingFace GPU cluster. The only requirement is that the model is a HuggingFace Transformers model with weights available on the Hub. They also allow for model evaluations with delta-weights for non-commercial licensed models, like the original LLaMa release. Users can easily filter models based on their type, whether pre-trained, fine-tuned, instruction-tuned or RL-tuned.
The evaluation process used by the Chatbot Arena Leaderboard involves three benchmarks: 1Chatbot Arena, MT-Bench, and MMLU (5-shot). Models compete on Chatbot Arena in randomized settings, answer multi-turn questions on MT-Bench, and undergo a rigorous multitask accuracy test on MMLU (5-shot) across 57 tasks. The leaderboard is meticulous in its calculation of ratings and scores.
The AlpacaEval Leaderboard has been designed to evaluate LLMs' ability to follow instructions. The models are evaluated based on their success rate and output length.
Open Source Model Development Challenges
Open-source development for Large Language Models (LLMs) brings numerous advantages, like collaboration, transparency, and innovation. However, building and maintaining these models presents its share of challenges, including:
Cost: Developing and maintaining open-source models can be financially burdensome, particularly for smaller teams and individual developers.
Companies like MosaicML and Databricks are trying to make fine-tuning more accessible through their platforms. Others, like Lambda, are working on reducing GPU costs. Still, cost-related issues persist.
Privacy Issues: Navigating sensitive data collection and storage within open-source models poses privacy hurdles demanding careful management.
Bias/Fairness: Eliminating biases and fostering impartial outcomes in open-source models is a pivotal challenge for achieving equitable AI.
Model Interpretability: Enhancing interpretability of intricate models is vital, particularly as open-source models can resemble "black boxes," obscuring decision-making transparency.
Version Control and Compatibility: The collaborative nature of open-source development introduces hurdles in maintaining version control and ensuring cross-platform compatibility.
Choosing the Right LLM - Best Practices
electing the appropriate Large Language Model (LLM) for your business use case requires a systematic approach. Here's a short step-by-step guide to help you make the right choice:
Identify Your Needs: Define your specific use case and objectives. Are you aiming for content generation, sentiment analysis, translation, or something else?
Analyze Data Requirements: Assess the amount and nature of data available for training and fine-tuning the model. Some LLMs require substantial data for optimal performance.
Consider Model Size: Choose a model size that aligns with your computational resources and performance needs. Larger models might offer better accuracy but come with increased resource demands.
Evaluate Pre-trained Models: Investigate existing pre-trained models that match your use case. Models like GPT-3, BERT, and T5 offer different strengths, such as language understanding or generation.
Customization Flexibility: Determine if the model allows fine-tuning for domain-specific language and nuances. Some models offer greater customization options.
Check Interpretability: Ensure the model provides insights into its decision-making process. Transparent models are crucial for understanding and troubleshooting.
Assess Bias and Fairness: Examine how the model addresses bias in language generation. Consider models that prioritize fairness and inclusivity.
API and Integration: Evaluate the availability and ease of using the model's API. Compatibility with your existing systems is essential for seamless integration.
Resource Requirements: Consider the hardware and computational resources needed to run the model effectively. Choose a model that aligns with your infrastructure.
Test and Validate: Before finalizing, conduct testing and validation to ensure the chosen model performs well on your specific tasks.
Monitor Performance: Regularly assess the model's performance in real-world scenarios and fine-tune if necessary to maintain accuracy and relevance.
What's Next for Open Source LLMs: Summary
Looking ahead, spanning 2023 and beyond, we can expect the open-source LLM landscape flourishing with the regular introduction of new models.
The 11 models that we’ve listed have made powerful language processing accessible, overcoming cost and proprietary hurdles.
However, they also face a multitude of challenges like cost, privacy, bias, and scalability. Users must consider these against benefits like customization, cost savings, and security, compared to proprietary LLMs that offer support but with fees and less flexibility.
Yet, the open-source community is committed to ethical, user-centric models. As technology evolves, these LLMs will progress, driving innovative, collaborative, and responsible AI-driven language processing.
We are excited to see what lies ahead for the AI community and hope you are, too!
Learn how to protect against the most common LLM vulnerabilities
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Explore the realm of prompt engineering and delve into essential techniques and tools for optimizing your prompts. Learn about various methods and techniques and gain insights into prompt engineering challenges.