Leaderboard – Speed - Let's Talk LLMs

Do you feel the need for Speed Speed Speed Speed ?

Navigating the intricate world of artificial intelligence requires a nuanced understanding of many variables, and the speed of an AI model is no exception. It’s not simply a measure of how fast the AI can think but also how swiftly the underlying hardware can respond.

When we talk about speed in AI, it’s a symbiotic dance between software and hardware. A prompt that leaps into existence in a brisk 1.2 seconds on my machine could saunter or sprint on yours, influenced by the unique orchestra of your computer’s hardware.

It’s essential to bear this in mind when we venture into the realm of timing tests — they are inherently personal. The stopwatch starts not just on the AI but on the system that cradles it.

In my quest to test the agility of Large Language Models (LLMs), the machine in question is no average computing workhorse. It’s equipped with a GeForce RTX 3090, boasting a substantial 24GB of VRAM muscle. The heart of this system is an AMD Ryzen 9 5900X processor, a silicon brain that ticks at a base clock speed of 3.7 GHz and can surge up to 4.8 GHz when the task demands. With 64MB of L3 cache, it’s primed to dispatch the hefty demands of modern AI tasks with calculated precision.

As we delve into the performance of LLMs, remember the stage is set not just by the coded intellect of the AI but equally by the electronic sinews of the machine that empowers it.

All speed tests were conducted using the following prompt: “Brainstorm ten potential titles for a romantic comedy about two coworkers who fall in love while working at a tech startup.”

The LLM Speed Leaderboard

	Parameters	Tokens Per Second	Tokens	Output Generated	Context
SynthIA v3.0	7B	53.68	274	5.10	48
MLewdBoros LRSGPT 2Char	13B	52.09	225	4.32	48
Llama 2 Chat AYB	13B	49.64	103	2.08	48
Thespis v0.4	13B	49.44	106	2.14	48
Wizard Vicuna Uncensored	7B	41.13	105	2.55	48
SynthIA v2.0	7B	37.08	92	2.48	46
Athena v2	13B	36.84	106	2.88	48
Casual LM	7B	35.21	71	2.02	41
Athena v4	13B	34.77	84	2.42	48
Wizard Vicuna Uncensored	13B	34.66	74	2.13	48
U-Amethyst	20B	34.30	148	4.31	48
Zephyr Beta	7B	29.25	331	11.32	46
Minstral OmniMix	11B	25.49	146	5.73	46
Wizard Vicuna Uncensored	30B	23.54	88	3.74	48
Airoboros 3.1.2	34B	19.81	97	4.90	48
WizardLM 1.0 Uncensored CodeLlama	34B	19.29	236	12.24	48
OpenBuddy Llama2 v13.2	70B	1.82	215	118.16	48
Stellar Bright	70B	1.49	93	62.52	49

For each LLM speed test, you’ll see results that say something like, “Output generated in 4.32 seconds (52.09 tokens/s, 225 tokens, context 48),” but what does this data actually mean? Well, this information is giving us a snapshot of the performance metrics for that specific request. Here’s what each part means:

4.32 seconds: The total time it took for the model to generate the output after receiving the input prompt.
52.09 tokens/s: The average rate at which the model processed and generated tokens (a token can be a part of a word, a whole word, or even punctuation, depending on the model’s tokenization scheme).
225 tokens: The total number of tokens that make up the output generated by the model.

Now, the term context refers to the amount of information (in tokens) the model has considered from the input to generate the output. In this case, “context 48” means the model took into account the last 48 tokens from the provided prompt to understand and complete the task. This is essentially the window of the most recent information the model uses to make predictions and generate a response.

The “context” is crucial because it determines how much of the prior text the model can “see” and use to make coherent and relevant output. If the context is too small, the model may not have enough information to produce a quality response. If it’s too large, there might be information that is not immediately relevant to the current output, although most modern LLMs can handle quite a large context window effectively.

These tokens in the “context” are the most recent tokens the model is using to generate its response. There’s a limit to how much context these models can consider, known as the model’s “context window” or “attention window,” and for many models, this typically ranges from 512 to 2048 tokens.

The optimal number of tokens in the context is task-dependent. For detailed and complex discussions, a larger context is typically better. For quick, simple queries, a smaller context may suffice. The design of LLMs usually finds a balance, ensuring that the context window is large enough for a wide range of tasks without being so large as to require prohibitive computational resources.

Do you feel the need for Speed Speed Speed Speed ?

The LLM Speed Leaderboard

Useful Links