llm-leaderboard issues

Where do you find the evaluation result for Chinchilla-70b of MMLU?

![image](https://github.com/LudwigStumpp/llm-leaderboard/assets/8592144/fe4b9880-264a-467b-8ce1-78164d6fd773) I cannot find any relevant evaluation on the [linked paper](https://arxiv.org/abs/2203.15556v1)... ![image](https://github.com/LudwigStumpp/llm-leaderboard/assets/8592144/6a05ad1b-eeee-4758-abd6-d9564cf92aa7)

zhimin-z

The evaluation results are retrieved from 10-shots rather than 0-shot.

![image](https://github.com/LudwigStumpp/llm-leaderboard/assets/8592144/e656679e-36c4-4b63-9ca7-2a43d236755c) As is seen from https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval ![image](https://github.com/LudwigStumpp/llm-leaderboard/assets/8592144/1d65e4e8-9ff0-4c26-9c93-4479dce8ceb3)

zhimin-z

What are the metrics for those benchmark?

Except for pass@1 and elo rating, do other benchmarks only use `accuracy` for evaluation? Yeah, I think there are big issues since most evaluation results are using their respective metrics......

zhimin-z

Make the table reader friendly and exportable to a new page

4

Please make the table reader friendly by fixing the header row and fixing the first column (with the benchmark name) so that these stay in place when the table is...

dlippold

Weighted Average On Page Load label:enhancement

It would be nice if the list was auto-sorted to a weighted average across all rankings on page first load. I would suggest using Trueskill - which can rank players...

butterisgod

Add more property columns to model rank tables.

A column for censored yes/no A column for model size in GB (0.4 for 400MB) A column if the AI mentions it's an AI. (Some models, though i can only...

somenewaccountthen

LLMZoo

A source that might be of interest to this project: https://github.com/FreedomIntelligence/LLMZoo

jh2048

Request: benchmark dromedary 65b variant

1

HF [repo](https://huggingface.co/zhiqings/dromedary-65b-lora-delta-v0)

creatorrr

Request: Benchmark

1

Hi, great work again. Are there any possibility to include any of these model's score on the benchmark if its available? Anthropic's Claude models: https://www.anthropic.com/product Cohere's LLM: https://docs.cohere.com/docs/introduction-to-large-language-models a21's Jurassic...

touhi99

The leaderboard is inaccessible in Steamlit...

2

Check https://llm-leaderboard.streamlit.app

zhimin-z

llm-leaderboard
llm-leaderboard copied to clipboard

Metadata

Where do you find the evaluation result for Chinchilla-70b of MMLU?

The evaluation results are retrieved from 10-shots rather than 0-shot.

What are the metrics for those benchmark?

Make the table reader friendly and exportable to a new page

Weighted Average On Page Load label:enhancement

Add more property columns to model rank tables.

LLMZoo

Request: benchmark dromedary 65b variant

Request: Benchmark

The leaderboard is inaccessible in Steamlit...

← Metadata

Owner

Metadata

llm-leaderboard llm-leaderboard copied to clipboard

Metadata

← Metadata

Owner

Metadata

llm-leaderboard
llm-leaderboard copied to clipboard