Just How Accurate Accurate Accurate Accurate Accurate are these LLMs?

When you start diving deep into the fascinating world of open-source Large Language Models (LLMs) you find out it really is quite the adventure! These AI-driven wonders are transforming how we interact with technology, offering a helping hand in everything from writing emails to whipping up an epic blog post.

But, how accurate are they really? 🎯

Well, truth be told, it’s a mix. While these models can churn out content that feels like it was written by a human, they’re not infallible. Yes, they can astound us with detailed explanations and creative storytelling, but they can also slip up with inaccuracies or misunderstandings of complex topics.

The beauty of open-source models? We get to tinker and improve upon them! 💡 They’re like digital clay, constantly being molded by a community of brilliant minds from all corners of the globe. 🌐

So, if you’re using an LLM for work or just for fun, remember to give it a little fact-check, especially if it’s about something super important. And if you’re into AI, why not contribute to these projects? Your expertise could help sharpen the accuracy of these digital oracles!

 

The LLM Accuracy Leaderboard

ParametersQuestion 1 chargebackQuestion 2 CleopatraQuestion 3 NorseQuestion 4 1/1Question 5 7th PrezQuestion 6 PlanetsOther DeductionsTotal
Airoboros 3.1.234BAccurateNot AccurateAccurateAccurateAccurateAccurate05
Athena v213B
AccurateNot AccurateNot AccurateAccurateNot AccurateAccurate
03
Athena v413BAccuratePartlyNot AccurateNot AccurateNot AccurateAccurate02.5
Casual LM7BAccurateNot AccuratePartlyAccurateNot AccurateAccurate03.5
Llama 2 Chat AYB13BAccurateNot AccurateAccurateNot AccurateNot AccurateAccurate03
Minstral OmniMix11BAccurateNot AccuratePartlyAccurateNot AccurateAccurate03.5
MLewdBoros LRSGPT 2Char13BAccurateNot AccurateNot AccurateNot AccurateAccurateAccurate21
OpenBuddy Llama2 v13.270BAccurateAccurateAccurateAccurateNot AccurateAccurate14
Stellar Bright70BAccuratePartlyNot AccurateAccurateAccurateAccurate13.5
SynthIA v2.07BAccurateNot AccurateNot AccurateAccurateNot AccurateAccurate03
SynthIA v3.07BAccurateNot AccuratePartlyAccuratePartly
Accurate22
Thespis v0.413BAccurateNot AccuratePartly
AccurateNot AccurateAccurate12.5
U-Amethyst20BAccurateNot AccurateNot AccurateNot AccurateNot AccurateAccurate02
Wizard Vicuna Uncensored7BAccurateNot AccuratePartlyNot AccurateNot AccurateAccurate02.5
Wizard Vicuna Uncensored13BAccurateNot AccurateAccurateNot AccuratePartlyAccurate
12.5
Wizard Vicuna Uncensored30BAccurateNot AccurateAccurateAccurateNot AccurateAccurate04
WizardLM 1.0 Uncensored CodeLlama34BAccurateNot AccurateNot AccurateAccurateNot AccurateAccurate.52.5
Zephyr Beta7BAccurateNot AccuratePartlyAccurateNot AccurateAccurate03.5

The prowess of Large Language Models (LLMs) often shines in the most human of tasks: composing an eloquent email or the crafting of a captivating social media narrative. Yet, beyond these creative exploits lies a core question that’s less about artistry and more about integrity: How reliable are these models when it comes to factual precision?

My foray into the assessment of LLMs pivoted on this very inquiry. I set aside the inherently subjective elements—such as the perennial debate over the “best” hashtags—to focus squarely on verifiable truths. From a pool of 25 probing questions, I distilled a suite of 6 stringent benchmarks designed to scrutinize the models’ commitment to accuracy.

The questions ranged from the procedural intricacies of finance (“What is a chargeback?”) to the vestiges of history (“How many children did Cleopatra of ancient Egypt have?”). They spanned the gamut of knowledge, including etymology (“Hundred was derived from what Norse number?”), elementary arithmetic (“If 1+1=2 and 1*1=1, then what is 1/1?”), presidential lineage (“Who was the 7th President of the United States and how many children did he have?”), and even astronomy (“How many planets are in our solar system?”).

But accuracy wasn’t the sole barometer of competence — clarity is also important. Thus, points were deducted for responses marred by typographical blunders or other glaring errors, reflecting the stringent standards expected of any tool meant to augment our daily lives.

Join me as I unravel the findings from this deep dive into the fact-finding faculties of LLMs, and let’s discover together just how much we can trust these digital intellects to not only emulate human creativity but also to echo our penchant for precision.

What is a chargeback?

Simply put, a chargeback is when the credit card provider (i.e.: Visa or MasterCard, or your Bank) reverses a disputed transaction. It’s like a forced refund. When a chargeback happens, the disputed funds are held from the business until the card issuer works things out and decides what to do. If the bank rules against you, those funds are returned to the cardholder. If the bank rules in your favor, they’ll send the disputed funds back to you.

How many children did Cleopatra of ancient Egypt have?

The answer is four. Her eldest son Caesarion was born on June 23, 47 BC.  His father was Julius Caesar. After Julius Ceasar died, Cleopatra eventually began a relationship with Marc Antony. Together they had three children. Their first two children were twins Alexander Helios and Cleopatra Selene.  They were born in 40 BC. After the twins, Cleopatra gave birth to her fourth and final child (her 3rd with Marc Antony). Her youngest child was a son whom she named Ptolemy Philadelphus. He was born in either August or September of 36 BC, meaning he was only 6 years old when his mother died.

Hundred was derived from what Norse number?

The short answer would be hundrað.

Basically, the word hundred comes from Old English hundred, from Proto-West Germanic *hundarad, from Proto-Germanic *hundaradą (“hundred”); some forms are remodeled on Old Norse hundrað.

If 1+1=2 and 1*1=2, then what is 1/1?

The short answer is 1. 1 divided by 1 = 1.

Who was the 7th President of the United States, and how many children did he have?

Andrew Jackson was the 7th President of the United States.

Andrew and Rachel never had any biological children of their own, yet there were always children living at The Hermitage.

When Rachel’s brother and sister-in-law gave birth to twin boys, the Jacksons adopted one of them as their own in 1809. Named Andrew Jackson Jr., he would grow up at The Hermitage but also remain close to his biological brother and parents, now his cousin, aunt and uncle. He went on to marry and have five children of his own.

Jackson also became the legal guardian to a number of other children, including a Native American orphan Jackson found in battle.

How many planets are in our solar system?

8. The eight planets are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.