LLMonitor Benchmarks
leaderboard | dataset | compare | about
Traditional LLMs benchmarks have drawbacks: they quickly become part of training datasets and are hard to relate to in terms of real-world use-cases.
I made this as an experiment to address these issues. Here, the dataset is dynamic (changes every week) and composed of crowdsourced real-world prompts.
We then use GPT-4 to grade each model's response against a set of rubrics (more details on the about page). The prompt dataset is easily explorable.
Everything is then stored in a Postgres database and this page shows the raw results.
Rank | Model | Score | Results |
---|---|---|---|
1 | GPT 4 03/14 (Legacy) | 93 | view |
2 | GPT 4 | 89 | view |
3 | GPT 3.5 Turbo Instruct | 84 | view |
4 | GPT 3.5 Turbo | 81 | view |
5 | GPT 3.5 Turbo 03/01 (Legacy) | 79 | view |
6 | Claude v2 | 68 | view |
7 | Falcon Chat (180B) | 67 | view |
8 | Hermes Llama2 13B | 66 | view |
8 | Claude v1 | 66 | view |
9 | Jurassic 2 Ultra | 61 | view |
9 | ReMM SLERP L2 13B | 61 | view |
9 | Synthia 70B | 61 | view |
10 | PaLM 2 Bison (Code Chat) | 60 | view |
10 | Jurassic 2 Mid | 60 | view |
10 | Claude Instant v1 | 60 | view |
10 | LLaMA-2-Chat (70B) | 60 | view |
11 | Mythalion 13B | 59 | view |
12 | Phind CodeLlama 34B v2 | 57 | view |
12 | PaLM 2 Bison | 57 | view |
12 | Mistral 7B Instruct v0.1 | 57 | view |
13 | MythoMax-L2 (13B) | 56 | view |
14 | command | 55 | view |
15 | Guanaco (65B) | 51 | view |
15 | Airoboros L2 70B | 51 | view |
16 | Vicuna v1.3 (13B) | 50 | view |
16 | LLaMA-2-Chat (13B) | 50 | view |
16 | LLaMA-2-Chat (7B) | 50 | view |
17 | command-nightly | 47 | view |
18 | Chronos Hermes (13B) | 45 | view |
19 | MPT-Chat (7B) | 43 | view |
19 | Guanaco (33B) | 43 | view |
20 | Vicuna v1.3 (7B) | 41 | view |
21 | MPT-Chat (30B) | 40 | view |
21 | Falcon Instruct (40B) | 40 | view |
22 | Alpaca (7B) | 39 | view |
23 | Pythia-Chat-Base (7B) | 38 | view |
23 | Code Llama Instruct (13B) | 38 | view |
23 | RedPajama-INCITE Chat (7B) | 38 | view |
24 | GPT-NeoXT-Chat-Base (20B) | 34 | view |
24 | Code Llama Instruct (34B) | 34 | view |
25 | StarCoderChat Alpha (16B) | 33 | view |
25 | command-light | 33 | view |
26 | Weaver 12k | 32 | view |
26 | Falcon Instruct (7B) | 32 | view |
27 | Koala (13B) | 31 | view |
28 | Jurassic 2 Light | 30 | view |
28 | Guanaco (13B) | 30 | view |
29 | Code Llama Instruct (7B) | 24 | view |
29 | RedPajama-INCITE Chat (3B) | 24 | view |
30 | Dolly v2 (12B) | 23 | view |
31 | Dolly v2 (7B) | 21 | view |
32 | Dolly v2 (3B) | 17 | view |
33 | Open-Assistant StableLM SFT-7 (7B) | 10 | view |
34 | Open-Assistant Pythia SFT-4 (12B) | 7 | view |