Leaderboard – Accuracy - Let's Talk LLMs

Just How Accurate Accurate Accurate Accurate Accurate are these LLMs?

When you start diving deep into the fascinating world of open-source Large Language Models (LLMs) you find out it really is quite the adventure! These AI-driven wonders are transforming how we interact with technology, offering a helping hand in everything from writing emails to whipping up an epic blog post.

But, how accurate are they really?

Well, truth be told, it’s a mix. While these models can churn out content that feels like it was written by a human, they’re not infallible. Yes, they can astound us with detailed explanations and creative storytelling, but they can also slip up with inaccuracies or misunderstandings of complex topics.

The beauty of open-source models? We get to tinker and improve upon them! They’re like digital clay, constantly being molded by a community of brilliant minds from all corners of the globe.

So, if you’re using an LLM for work or just for fun, remember to give it a little fact-check, especially if it’s about something super important. And if you’re into AI, why not contribute to these projects? Your expertise could help sharpen the accuracy of these digital oracles!

The LLM Accuracy Leaderboard

	Parameters	Question 1 chargeback	Question 2 Cleopatra	Question 3 Norse	Question 4 1/1	Question 5 7th Prez	Question 6 Planets	Other Deductions	Total
Airoboros 3.1.2	34B	Accurate	Not Accurate	Accurate	Accurate	Accurate	Accurate	0	5
Athena v2	13B	Accurate	Not Accurate	Not Accurate	Accurate	Not Accurate	Accurate	0	3
Athena v4	13B	Accurate	Partly	Not Accurate	Not Accurate	Not Accurate	Accurate	0	2.5
Casual LM	7B	Accurate	Not Accurate	Partly	Accurate	Not Accurate	Accurate	0	3.5
Llama 2 Chat AYB	13B	Accurate	Not Accurate	Accurate	Not Accurate	Not Accurate	Accurate	0	3
Minstral OmniMix	11B	Accurate	Not Accurate	Partly	Accurate	Not Accurate	Accurate	0	3.5
MLewdBoros LRSGPT 2Char	13B	Accurate	Not Accurate	Not Accurate	Not Accurate	Accurate	Accurate	2	1
OpenBuddy Llama2 v13.2	70B	Accurate	Accurate	Accurate	Accurate	Not Accurate	Accurate	1	4
Stellar Bright	70B	Accurate	Partly	Not Accurate	Accurate	Accurate	Accurate	1	3.5
SynthIA v2.0	7B	Accurate	Not Accurate	Not Accurate	Accurate	Not Accurate	Accurate	0	3
SynthIA v3.0	7B	Accurate	Not Accurate	Partly	Accurate	Partly	Accurate	2	2
Thespis v0.4	13B	Accurate	Not Accurate	Partly	Accurate	Not Accurate	Accurate	1	2.5
U-Amethyst	20B	Accurate	Not Accurate	Not Accurate	Not Accurate	Not Accurate	Accurate	0	2
Wizard Vicuna Uncensored	7B	Accurate	Not Accurate	Partly	Not Accurate	Not Accurate	Accurate	0	2.5
Wizard Vicuna Uncensored	13B	Accurate	Not Accurate	Accurate	Not Accurate	Partly	Accurate	1	2.5
Wizard Vicuna Uncensored	30B	Accurate	Not Accurate	Accurate	Accurate	Not Accurate	Accurate	0	4
WizardLM 1.0 Uncensored CodeLlama	34B	Accurate	Not Accurate	Not Accurate	Accurate	Not Accurate	Accurate	.5	2.5
Zephyr Beta	7B	Accurate	Not Accurate	Partly	Accurate	Not Accurate	Accurate	0	3.5

The prowess of Large Language Models (LLMs) often shines in the most human of tasks: composing an eloquent email or the crafting of a captivating social media narrative. Yet, beyond these creative exploits lies a core question that’s less about artistry and more about integrity: How reliable are these models when it comes to factual precision?

My foray into the assessment of LLMs pivoted on this very inquiry. I set aside the inherently subjective elements—such as the perennial debate over the “best” hashtags—to focus squarely on verifiable truths. From a pool of 25 probing questions, I distilled a suite of 6 stringent benchmarks designed to scrutinize the models’ commitment to accuracy.

The questions ranged from the procedural intricacies of finance (“What is a chargeback?”) to the vestiges of history (“How many children did Cleopatra of ancient Egypt have?”). They spanned the gamut of knowledge, including etymology (“Hundred was derived from what Norse number?”), elementary arithmetic (“If 1+1=2 and 1*1=1, then what is 1/1?”), presidential lineage (“Who was the 7th President of the United States and how many children did he have?”), and even astronomy (“How many planets are in our solar system?”).

But accuracy wasn’t the sole barometer of competence — clarity is also important. Thus, points were deducted for responses marred by typographical blunders or other glaring errors, reflecting the stringent standards expected of any tool meant to augment our daily lives.

Join me as I unravel the findings from this deep dive into the fact-finding faculties of LLMs, and let’s discover together just how much we can trust these digital intellects to not only emulate human creativity but also to echo our penchant for precision.

What is a chargeback?

Simply put, a chargeback is when the credit card provider (i.e.: Visa or MasterCard, or your Bank) reverses a disputed transaction. It’s like a forced refund. When a chargeback happens, the disputed funds are held from the business until the card issuer works things out and decides what to do. If the bank rules against you, those funds are returned to the cardholder. If the bank rules in your favor, they’ll send the disputed funds back to you.

How many children did Cleopatra of ancient Egypt have?

The answer is four. Her eldest son Caesarion was born on June 23, 47 BC. His father was Julius Caesar. After Julius Ceasar died, Cleopatra eventually began a relationship with Marc Antony. Together they had three children. Their first two children were twins Alexander Helios and Cleopatra Selene. They were born in 40 BC. After the twins, Cleopatra gave birth to her fourth and final child (her 3rd with Marc Antony). Her youngest child was a son whom she named Ptolemy Philadelphus. He was born in either August or September of 36 BC, meaning he was only 6 years old when his mother died.

Hundred was derived from what Norse number?

The short answer would be hundrað.

Basically, the word hundred comes from Old English hundred, from Proto-West Germanic *hundarad, from Proto-Germanic *hundaradą (“hundred”); some forms are remodeled on Old Norse hundrað.

If 1+1=2 and 1*1=2, then what is 1/1?

The short answer is 1. 1 divided by 1 = 1.

Who was the 7th President of the United States, and how many children did he have?

Andrew Jackson was the 7th President of the United States.

Andrew and Rachel never had any biological children of their own, yet there were always children living at The Hermitage.

When Rachel’s brother and sister-in-law gave birth to twin boys, the Jacksons adopted one of them as their own in 1809. Named Andrew Jackson Jr., he would grow up at The Hermitage but also remain close to his biological brother and parents, now his cousin, aunt and uncle. He went on to marry and have five children of his own.

Jackson also became the legal guardian to a number of other children, including a Native American orphan Jackson found in battle.

How many planets are in our solar system?

8. The eight planets are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.