Just How Accurate Accurate Accurate Accurate Accurate are these LLMs?
When you start diving deep into the fascinating world of open-source Large Language Models (LLMs) you find out it really is quite the adventure! These AI-driven wonders are transforming how we interact with technology, offering a helping hand in everything from writing emails to whipping up an epic blog post.
But, how accurate are they really?
Well, truth be told, it’s a mix. While these models can churn out content that feels like it was written by a human, they’re not infallible. Yes, they can astound us with detailed explanations and creative storytelling, but they can also slip up with inaccuracies or misunderstandings of complex topics.
The beauty of open-source models? We get to tinker and improve upon them! They’re like digital clay, constantly being molded by a community of brilliant minds from all corners of the globe.
So, if you’re using an LLM for work or just for fun, remember to give it a little fact-check, especially if it’s about something super important. And if you’re into AI, why not contribute to these projects? Your expertise could help sharpen the accuracy of these digital oracles!
The LLM Accuracy Leaderboard
Parameters | Question 1 chargeback | Question 2 Cleopatra | Question 3 Norse | Question 4 1/1 | Question 5 7th Prez | Question 6 Planets | Other Deductions | Total | |
---|---|---|---|---|---|---|---|---|---|
Airoboros 3.1.2 | 34B | Accurate | Not Accurate | Accurate | Accurate | Accurate | Accurate | 0 | 5 |
Athena v2 | 13B | Accurate | Not Accurate | Not Accurate | Accurate | Not Accurate | Accurate | 0 | 3 |
Athena v4 | 13B | Accurate | Partly | Not Accurate | Not Accurate | Not Accurate | Accurate | 0 | 2.5 |
Casual LM | 7B | Accurate | Not Accurate | Partly | Accurate | Not Accurate | Accurate | 0 | 3.5 |
Llama 2 Chat AYB | 13B | Accurate | Not Accurate | Accurate | Not Accurate | Not Accurate | Accurate | 0 | 3 |
Minstral OmniMix | 11B | Accurate | Not Accurate | Partly | Accurate | Not Accurate | Accurate | 0 | 3.5 |
MLewdBoros LRSGPT 2Char | 13B | Accurate | Not Accurate | Not Accurate | Not Accurate | Accurate | Accurate | 2 | 1 |
OpenBuddy Llama2 v13.2 | 70B | Accurate | Accurate | Accurate | Accurate | Not Accurate | Accurate | 1 | 4 |
Stellar Bright | 70B | Accurate | Partly | Not Accurate | Accurate | Accurate | Accurate | 1 | 3.5 |
SynthIA v2.0 | 7B | Accurate | Not Accurate | Not Accurate | Accurate | Not Accurate | Accurate | 0 | 3 |
SynthIA v3.0 | 7B | Accurate | Not Accurate | Partly | Accurate | Partly | Accurate | 2 | 2 |
Thespis v0.4 | 13B | Accurate | Not Accurate | Partly | Accurate | Not Accurate | Accurate | 1 | 2.5 |
U-Amethyst | 20B | Accurate | Not Accurate | Not Accurate | Not Accurate | Not Accurate | Accurate | 0 | 2 |
Wizard Vicuna Uncensored | 7B | Accurate | Not Accurate | Partly | Not Accurate | Not Accurate | Accurate | 0 | 2.5 |
Wizard Vicuna Uncensored | 13B | Accurate | Not Accurate | Accurate | Not Accurate | Partly | Accurate | 1 | 2.5 |
Wizard Vicuna Uncensored | 30B | Accurate | Not Accurate | Accurate | Accurate | Not Accurate | Accurate | 0 | 4 |
WizardLM 1.0 Uncensored CodeLlama | 34B | Accurate | Not Accurate | Not Accurate | Accurate | Not Accurate | Accurate | .5 | 2.5 |
Zephyr Beta | 7B | Accurate | Not Accurate | Partly | Accurate | Not Accurate | Accurate | 0 | 3.5 |
The prowess of Large Language Models (LLMs) often shines in the most human of tasks: composing an eloquent email or the crafting of a captivating social media narrative. Yet, beyond these creative exploits lies a core question that’s less about artistry and more about integrity: How reliable are these models when it comes to factual precision?
My foray into the assessment of LLMs pivoted on this very inquiry. I set aside the inherently subjective elements—such as the perennial debate over the “best” hashtags—to focus squarely on verifiable truths. From a pool of 25 probing questions, I distilled a suite of 6 stringent benchmarks designed to scrutinize the models’ commitment to accuracy.
The questions ranged from the procedural intricacies of finance (“What is a chargeback?”) to the vestiges of history (“How many children did Cleopatra of ancient Egypt have?”). They spanned the gamut of knowledge, including etymology (“Hundred was derived from what Norse number?”), elementary arithmetic (“If 1+1=2 and 1*1=1, then what is 1/1?”), presidential lineage (“Who was the 7th President of the United States and how many children did he have?”), and even astronomy (“How many planets are in our solar system?”).
But accuracy wasn’t the sole barometer of competence — clarity is also important. Thus, points were deducted for responses marred by typographical blunders or other glaring errors, reflecting the stringent standards expected of any tool meant to augment our daily lives.
Join me as I unravel the findings from this deep dive into the fact-finding faculties of LLMs, and let’s discover together just how much we can trust these digital intellects to not only emulate human creativity but also to echo our penchant for precision.
What is a chargeback?
Simply put, a chargeback is when the credit card provider (i.e.: Visa or MasterCard, or your Bank) reverses a disputed transaction. It’s like a forced refund. When a chargeback happens, the disputed funds are held from the business until the card issuer works things out and decides what to do. If the bank rules against you, those funds are returned to the cardholder. If the bank rules in your favor, they’ll send the disputed funds back to you.
How many children did Cleopatra of ancient Egypt have?
The answer is four. Her eldest son Caesarion was born on June 23, 47 BC. His father was Julius Caesar. After Julius Ceasar died, Cleopatra eventually began a relationship with Marc Antony. Together they had three children. Their first two children were twins Alexander Helios and Cleopatra Selene. They were born in 40 BC. After the twins, Cleopatra gave birth to her fourth and final child (her 3rd with Marc Antony). Her youngest child was a son whom she named Ptolemy Philadelphus. He was born in either August or September of 36 BC, meaning he was only 6 years old when his mother died.
Hundred was derived from what Norse number?
The short answer would be hundrað.
Basically, the word hundred comes from Old English hundred, from Proto-West Germanic *hundarad, from Proto-Germanic *hundaradą (“hundred”); some forms are remodeled on Old Norse hundrað.
If 1+1=2 and 1*1=2, then what is 1/1?
The short answer is 1. 1 divided by 1 = 1.
Who was the 7th President of the United States, and how many children did he have?
Andrew Jackson was the 7th President of the United States.
Andrew and Rachel never had any biological children of their own, yet there were always children living at The Hermitage.
When Rachel’s brother and sister-in-law gave birth to twin boys, the Jacksons adopted one of them as their own in 1809. Named Andrew Jackson Jr., he would grow up at The Hermitage but also remain close to his biological brother and parents, now his cousin, aunt and uncle. He went on to marry and have five children of his own.
Jackson also became the legal guardian to a number of other children, including a Native American orphan Jackson found in battle.
How many planets are in our solar system?
8. The eight planets are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.