Navigating the Surge of Open Source Large Language Models (LLMs) isn't always easy!
With the wave of generative AI, the appearance of new LLMs like GPT-4, Llama, or Claude has become a daily headline. Yet, the question stands – how do we gauge their effectiveness for the everyday user?
Traditional benchmarks may not align with the practical needs of those less versed in AI or tech – like crafting a simple email or a catchy social media update.
So, I took matters into my own hands. I challenged a range of open-source LLMs with a set of 25 questions reflecting everyday tasks.
The academic benchmarks often fall short for generative models where a variety of answers could all be considered “correct.” Hence, my approach is more pragmatic and tailored for the average person curious about the utility of LLMs for basic business or personal tasks.
Each LLM faced the same practical inquiries – from history facts to composing tweets. The goal? To discern which LLM handles everyday tasks best.
The findings will be shared as-is, typos included, to maintain authenticity. Perfect? Far from it. Insightful? That’s the aim.
This isn’t for the AI gurus seeking technical validation but for the everyday user who wants a glimpse into the practical potential of AI tools.
Accuracy
Speed
Logic
Creativity
The LLM Leaderboard
Parameters | Accuracy | Speed | Creativity | Logical Interpetation | Total Score | |
---|---|---|---|---|---|---|
Llama 2 Chat AYB | 13B | 3 | 16 | 43 | 1 | 63 |
SynthIA v2.0 | 7B | 3 | 13 | 40 | 3 | 59 |
SynthIA v3.0 | 7B | 2 | 18 | 35.5 | 3 | 58.5 |
Athena v2 | 13B | 3 | 12 | 39.5 | 3 | 57.5 |
Casual LM | 7B | 3.5 | 11 | 37.5 | 4 | 56 |
Wizard Vicuna Uncensored | 7B | 2.5 | 14 | 36 | 3 | 55.5 |
MLewdBoros LRSGPT 2Char | 13B | 1 | 17 | 35 | 2 | 55 |
Airoboros 3.1.2 | 34B | 5 | 4 | 42 | 3 | 54 |
Athena v4 | 13B | 2.5 | 10 | 37.5 | 1 | 51 |
Minstral OmniMix | 11B | 3.5 | 6 | 38.5 | 3 | 51 |
U-Amethyst | 20B | 2 | 8 | 39 | 1 | 50 |
Thespis v0.4 | 13B | 2.5 | 15 | 29 | 3 | 49.5 |
Zephyr Beta | 7B | 3.5 | 7 | 33 | 3 | 46.5 |
OpenBuddy Llama2 v13.2 | 70B | 4 | 2 | 32 | 3 | 41 |
Wizard Vicuna Uncensored | 13B | 2.5 | 9 | 28 | 1 | 40.5 |
Wizard Vicuna Uncensored | 30B | 4 | 5 | 26 | 1 | 36 |
Stellar Bright | 70B | 3.5 | 1 | 20.5 | 3 | 28 |
WizardLM 1.0 Uncensored CodeLlama | 34B | 2.5 | 3 | 15.5 | 2 | 23 |
As you sift through the data and interpret the scores, it’s essential to keep in mind that the metrics reflect more than mere numbers — they embody the proficiency and potential of these Large Language Models (LLMs) to serve our informational and communicative needs.
The higher the score, the more adept the LLM is in its specific domain, be it the swiftness of delivering content with a speed rating of 4, marking superior performance over a lower score of 2, or the reliability of the information with an accuracy rating of 4 indicating a more trustworthy source compared to a 2.
But numbers only tell part of the story. Behind each score is a spectrum of use cases and user interactions that truly define the value of these LLMs. As you consider these scores, reflect on how they align with your personal or professional requirements for speed, accuracy, and the often nuanced needs of human-machine communication.
May this exploration serve as a guidepost in your journey toward integrating AI into your daily digital life. Whether you seek an LLM for quick assistance or for comprehensive, accurate information retrieval, let these scores inform your choice but not dictate it entirely. After all, the most suitable LLM is the one that fits seamlessly into your workflow, enhancing your efficiency without compromising on quality.