Navigating the Surge of Open Source Large Language Models (LLMs) isn't always easy!

With the wave of generative AI, the appearance of new LLMs like GPT-4, Llama, or Claude has become a daily headline. Yet, the question stands – how do we gauge their effectiveness for the everyday user?

🔍 Traditional benchmarks may not align with the practical needs of those less versed in AI or tech – like crafting a simple email or a catchy social media update.

So, I took matters into my own hands. I challenged a range of open-source LLMs with a set of 25 questions reflecting everyday tasks.

The academic benchmarks often fall short for generative models where a variety of answers could all be considered “correct.” Hence, my approach is more pragmatic and tailored for the average person curious about the utility of LLMs for basic business or personal tasks.

🧪 Each LLM faced the same practical inquiries – from history facts to composing tweets. The goal? To discern which LLM handles everyday tasks best.

The findings will be shared as-is, typos included, to maintain authenticity. Perfect? Far from it. Insightful? That’s the aim.

This isn’t for the AI gurus seeking technical validation but for the everyday user who wants a glimpse into the practical potential of AI tools.

Accuracy

Airoboros
0

Speed

MLewdBoros
0

Logic

CasualLM
0

Creativity

Llama 2 Chat AYB
0

The LLM Leaderboard

ParametersAccuracySpeedCreativityLogical InterpetationTotal Score
Llama 2 Chat AYB13B31643163
SynthIA v2.07B31340359
SynthIA v3.07B21835.5358.5
Athena v213B
31239.5357.5
Casual LM7B3.51137.5456
Wizard Vicuna Uncensored7B2.51436355.5
MLewdBoros LRSGPT 2Char13B11735255
Airoboros 3.1.234B5442354
Athena v413B2.51037.5151
Minstral OmniMix11B3.5638.5351
U-Amethyst20B2839150
Thespis v0.413B2.51529349.5
Zephyr Beta7B3.5733346.5
OpenBuddy Llama2 v13.270B4232341
Wizard Vicuna Uncensored13B2.5928140.5
Wizard Vicuna Uncensored30B4526136
Stellar Bright70B3.5120.5328
WizardLM 1.0 Uncensored CodeLlama34B2.5315.5223

As you sift through the data and interpret the scores, it’s essential to keep in mind that the metrics reflect more than mere numbers — they embody the proficiency and potential of these Large Language Models (LLMs) to serve our informational and communicative needs.

The higher the score, the more adept the LLM is in its specific domain, be it the swiftness of delivering content with a speed rating of 4, marking superior performance over a lower score of 2, or the reliability of the information with an accuracy rating of 4 indicating a more trustworthy source compared to a 2.

But numbers only tell part of the story. Behind each score is a spectrum of use cases and user interactions that truly define the value of these LLMs. As you consider these scores, reflect on how they align with your personal or professional requirements for speed, accuracy, and the often nuanced needs of human-machine communication.

May this exploration serve as a guidepost in your journey toward integrating AI into your daily digital life. Whether you seek an LLM for quick assistance or for comprehensive, accurate information retrieval, let these scores inform your choice but not dictate it entirely. After all, the most suitable LLM is the one that fits seamlessly into your workflow, enhancing your efficiency without compromising on quality.