Do you feel the need for Speed Speed Speed Speed ?
Navigating the intricate world of artificial intelligence requires a nuanced understanding of many variables, and the speed of an AI model is no exception. It’s not simply a measure of how fast the AI can think but also how swiftly the underlying hardware can respond.
When we talk about speed in AI, it’s a symbiotic dance between software and hardware. A prompt that leaps into existence in a brisk 1.2 seconds on my machine could saunter or sprint on yours, influenced by the unique orchestra of your computer’s hardware.
It’s essential to bear this in mind when we venture into the realm of timing tests — they are inherently personal. The stopwatch starts not just on the AI but on the system that cradles it.
In my quest to test the agility of Large Language Models (LLMs), the machine in question is no average computing workhorse. It’s equipped with a GeForce RTX 3090, boasting a substantial 24GB of VRAM muscle. The heart of this system is an AMD Ryzen 9 5900X processor, a silicon brain that ticks at a base clock speed of 3.7 GHz and can surge up to 4.8 GHz when the task demands. With 64MB of L3 cache, it’s primed to dispatch the hefty demands of modern AI tasks with calculated precision.
As we delve into the performance of LLMs, remember the stage is set not just by the coded intellect of the AI but equally by the electronic sinews of the machine that empowers it.
All speed tests were conducted using the following prompt: “Brainstorm ten potential titles for a romantic comedy about two coworkers who fall in love while working at a tech startup.”
The LLM Speed Leaderboard
Parameters | Tokens Per Second | Tokens | Output Generated | Context | |
---|---|---|---|---|---|
SynthIA v3.0 | 7B | 53.68 | 274 | 5.10 | 48 |
MLewdBoros LRSGPT 2Char | 13B | 52.09 | 225 | 4.32 | 48 |
Llama 2 Chat AYB | 13B | 49.64 | 103 | 2.08 | 48 |
Thespis v0.4 | 13B | 49.44 | 106 | 2.14 | 48 |
Wizard Vicuna Uncensored | 7B | 41.13 | 105 | 2.55 | 48 |
SynthIA v2.0 | 7B | 37.08 | 92 | 2.48 | 46 |
Athena v2 | 13B | 36.84 | 106 | 2.88 | 48 |
Casual LM | 7B | 35.21 | 71 | 2.02 | 41 |
Athena v4 | 13B | 34.77 | 84 | 2.42 | 48 |
Wizard Vicuna Uncensored | 13B | 34.66 | 74 | 2.13 | 48 |
U-Amethyst | 20B | 34.30 | 148 | 4.31 | 48 |
Zephyr Beta | 7B | 29.25 | 331 | 11.32 | 46 |
Minstral OmniMix | 11B | 25.49 | 146 | 5.73 | 46 |
Wizard Vicuna Uncensored | 30B | 23.54 | 88 | 3.74 | 48 |
Airoboros 3.1.2 | 34B | 19.81 | 97 | 4.90 | 48 |
WizardLM 1.0 Uncensored CodeLlama | 34B | 19.29 | 236 | 12.24 | 48 |
OpenBuddy Llama2 v13.2 | 70B | 1.82 | 215 | 118.16 | 48 |
Stellar Bright | 70B | 1.49 | 93 | 62.52 | 49 |
For each LLM speed test, you’ll see results that say something like, “Output generated in 4.32 seconds (52.09 tokens/s, 225 tokens, context 48),” but what does this data actually mean? Well, this information is giving us a snapshot of the performance metrics for that specific request. Here’s what each part means:
- 4.32 seconds: The total time it took for the model to generate the output after receiving the input prompt.
- 52.09 tokens/s: The average rate at which the model processed and generated tokens (a token can be a part of a word, a whole word, or even punctuation, depending on the model’s tokenization scheme).
- 225 tokens: The total number of tokens that make up the output generated by the model.
Now, the term context refers to the amount of information (in tokens) the model has considered from the input to generate the output. In this case, “context 48” means the model took into account the last 48 tokens from the provided prompt to understand and complete the task. This is essentially the window of the most recent information the model uses to make predictions and generate a response.
The “context” is crucial because it determines how much of the prior text the model can “see” and use to make coherent and relevant output. If the context is too small, the model may not have enough information to produce a quality response. If it’s too large, there might be information that is not immediately relevant to the current output, although most modern LLMs can handle quite a large context window effectively.
These tokens in the “context” are the most recent tokens the model is using to generate its response. There’s a limit to how much context these models can consider, known as the model’s “context window” or “attention window,” and for many models, this typically ranges from 512 to 2048 tokens.
The optimal number of tokens in the context is task-dependent. For detailed and complex discussions, a larger context is typically better. For quick, simple queries, a smaller context may suffice. The design of LLMs usually finds a balance, ensuring that the context window is large enough for a wide range of tasks without being so large as to require prohibitive computational resources.