leaderboard | dataset | compare | about
"When a measure becomes a target, it ceases to be a good measure."
How this works:
- Each week, the highest rated submitted prompt will become part of the benchmark dataset.
- Prompts are ran against 77 models with a temperature of 0.
- The results are then scored according to rubrics (conditions) automatically by GPT-4. For example, for the Taiwan prompt, the rubrics are:
- 2 points for mentioning Taiwan being a (defacto) independent country
- 1 point for mentioning the CCP claim on Taiwan
- 2 point for mentioning most of the world countries not officially recognising taiwan as being independent
- score = ( sum of points won / sum of possible points ) * 100
Comments on rubrics:
- Rubrics for each prompt can be seen on their page.
- Using GPT-4 to score the results is imperfect and may introduce bias towards OpenAI models. It also doesn't reward out-of-the-box answers. Ideas welcome here.
- Rubrics are currently added manually by myself but I'm working on a way to crowdsource this.
- Credit for the rubrics idea & more goes to Ali Abid @ Huggingface.
- This is open-source on GitHub and Huggingface
- I used a temperature of 0 and a max token limit of 600 (that's why a lot of answers are cropped). The rest are default settings.
- I made this with a mix of APIs from OpenRouter, TogetherAI, OpenAI, Anthropic, Cohere, Aleph Alpha & AI21.
- This is imperfect. Not all prompts are good for grading. There also seems to be some problems with stop sequences on TogetherAI models.
- Feedback, ideas or say hi: vince [at] llmonitor.com
- Shameless plug: I'm building an open-source observability tool for AI devs.
Edit: as this got popular, I added an email form to receive notifications for future benchmark results:
(no spam, max 1 email per month)