LLM Leaderboard - Let's Talk LLMs

Navigating the Surge of Open Source Large Language Models (LLMs) isn't always easy!

With the wave of generative AI, the appearance of new LLMs like GPT-4, Llama, or Claude has become a daily headline. Yet, the question stands – how do we gauge their effectiveness for the everyday user?

Traditional benchmarks may not align with the practical needs of those less versed in AI or tech – like crafting a simple email or a catchy social media update.

So, I took matters into my own hands. I challenged a range of open-source LLMs with a set of 25 questions reflecting everyday tasks.

The academic benchmarks often fall short for generative models where a variety of answers could all be considered “correct.” Hence, my approach is more pragmatic and tailored for the average person curious about the utility of LLMs for basic business or personal tasks.

Each LLM faced the same practical inquiries – from history facts to composing tweets. The goal? To discern which LLM handles everyday tasks best.

The findings will be shared as-is, typos included, to maintain authenticity. Perfect? Far from it. Insightful? That’s the aim.

This isn’t for the AI gurus seeking technical validation but for the everyday user who wants a glimpse into the practical potential of AI tools.

Accuracy

Airoboros

Speed

MLewdBoros

Logic

CasualLM

Creativity

Llama 2 Chat AYB

The LLM Leaderboard

	Parameters	Accuracy	Speed	Creativity	Logical Interpetation	Total Score
Llama 2 Chat AYB	13B	3	16	43	1	63
SynthIA v2.0	7B	3	13	40	3	59
SynthIA v3.0	7B	2	18	35.5	3	58.5
Athena v2	13B	3	12	39.5	3	57.5
Casual LM	7B	3.5	11	37.5	4	56
Wizard Vicuna Uncensored	7B	2.5	14	36	3	55.5
MLewdBoros LRSGPT 2Char	13B	1	17	35	2	55
Airoboros 3.1.2	34B	5	4	42	3	54
Athena v4	13B	2.5	10	37.5	1	51
Minstral OmniMix	11B	3.5	6	38.5	3	51
U-Amethyst	20B	2	8	39	1	50
Thespis v0.4	13B	2.5	15	29	3	49.5
Zephyr Beta	7B	3.5	7	33	3	46.5
OpenBuddy Llama2 v13.2	70B	4	2	32	3	41
Wizard Vicuna Uncensored	13B	2.5	9	28	1	40.5
Wizard Vicuna Uncensored	30B	4	5	26	1	36
Stellar Bright	70B	3.5	1	20.5	3	28
WizardLM 1.0 Uncensored CodeLlama	34B	2.5	3	15.5	2	23

As you sift through the data and interpret the scores, it’s essential to keep in mind that the metrics reflect more than mere numbers — they embody the proficiency and potential of these Large Language Models (LLMs) to serve our informational and communicative needs.

The higher the score, the more adept the LLM is in its specific domain, be it the swiftness of delivering content with a speed rating of 4, marking superior performance over a lower score of 2, or the reliability of the information with an accuracy rating of 4 indicating a more trustworthy source compared to a 2.

But numbers only tell part of the story. Behind each score is a spectrum of use cases and user interactions that truly define the value of these LLMs. As you consider these scores, reflect on how they align with your personal or professional requirements for speed, accuracy, and the often nuanced needs of human-machine communication.

May this exploration serve as a guidepost in your journey toward integrating AI into your daily digital life. Whether you seek an LLM for quick assistance or for comprehensive, accurate information retrieval, let these scores inform your choice but not dictate it entirely. After all, the most suitable LLM is the one that fits seamlessly into your workflow, enhancing your efficiency without compromising on quality.

Navigating the Surge of Open Source Large Language Models (LLMs) isn't always easy!

The LLM Leaderboard

Useful Links