LLMonitor Benchmarks

leaderboard | dataset | compare | about


Traditional LLMs benchmarks have drawbacks: they quickly become part of training datasets and are hard to relate to in terms of real-world use-cases.

I made this as an experiment to address these issues. Here, the dataset is dynamic (changes every week) and composed of crowdsourced real-world prompts.

We then use GPT-4 to grade each model's response against a set of rubrics (more details on the about page). The prompt dataset is easily explorable.

Everything is then stored in a Postgres database and this page shows the raw results.


RankModelScoreResults
1GPT 4 03/14 (Legacy)93view
2GPT 489view
3GPT 3.5 Turbo Instruct84view
4GPT 3.5 Turbo81view
5GPT 3.5 Turbo 03/01 (Legacy)79view
6Claude v268view
7Falcon Chat (180B)67view
8Hermes Llama2 13B66view
8Claude v166view
9Jurassic 2 Ultra61view
9ReMM SLERP L2 13B61view
9Synthia 70B61view
10PaLM 2 Bison (Code Chat)60view
10Jurassic 2 Mid60view
10Claude Instant v160view
10LLaMA-2-Chat (70B)60view
11Mythalion 13B59view
12Phind CodeLlama 34B v2 57view
12PaLM 2 Bison57view
12Mistral 7B Instruct v0.157view
13MythoMax-L2 (13B)56view
14command55view
15Guanaco (65B)51view
15Airoboros L2 70B51view
16Vicuna v1.3 (13B)50view
16LLaMA-2-Chat (13B)50view
16LLaMA-2-Chat (7B)50view
17command-nightly47view
18Chronos Hermes (13B)45view
19MPT-Chat (7B)43view
19Guanaco (33B)43view
20Vicuna v1.3 (7B)41view
21MPT-Chat (30B)40view
21Falcon Instruct (40B)40view
22Alpaca (7B)39view
23Pythia-Chat-Base (7B)38view
23Code Llama Instruct (13B)38view
23RedPajama-INCITE Chat (7B)38view
24GPT-NeoXT-Chat-Base (20B)34view
24Code Llama Instruct (34B)34view
25StarCoderChat Alpha (16B)33view
25command-light33view
26Weaver 12k32view
26Falcon Instruct (7B)32view
27Koala (13B)31view
28Jurassic 2 Light30view
28Guanaco (13B)30view
29Code Llama Instruct (7B)24view
29RedPajama-INCITE Chat (3B)24view
30Dolly v2 (12B)23view
31Dolly v2 (7B)21view
32Dolly v2 (3B)17view
33Open-Assistant StableLM SFT-7 (7B)10view
34Open-Assistant Pythia SFT-4 (12B)7view