alpaca_eval
alpaca_eval copied to clipboard
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
I am running with 8 A40 GPU and I think it should be fast. I set up the environment and run ``` alpaca_eval evaluate_from_model --model_configs 'robin-v2-7b' --annotators_config 'claude' ``` and...
according to `git status` there was no updated leaderboard file. ``` win_rate standard_error n_total minotaur-13b 67.64 1.64 805 ```
Model outputs, and GPT4 eval results here: https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.2-alpaca-eval/tree/main
I want to propose adding a version signature to AlpacaEval a la [sacreBLEU signatures](https://github.com/mjpost/sacrebleu?tab=readme-ov-file#version-signatures) and explicit instructions for reporting scores to improve reproducibility. For those unfamiliar with the sacreBLEU metric,...
GPT-4 is so expensive if we have to run hundreds of experiments for scientific studies. I wonder whether you have tried using Llama3-70b, which performs comparably to the older version...
Hey Team, We're running some experiments with mistral 7b orpo and variants, but found that using GPT-4-1106-preview as baseline + openai gpt-4 judgement produce overly high results: ``` INFO:root:Not saving...
We would like to add [Aligner-2B+GPT-4 Turbo (04/09)](https://github.com/AlignInc/aligner-replication) to AlpacaEval 2.0. It is the reproduction of the paper - [Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction](https://arxiv.org/pdf/2402.02416.pdf) Thank you for such...
It would be really nice if Microsoft's new [Phi 3 models](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3) could be added to the AlpacaEval Leaderboard,
Hello, I am creating this PR to share the example of evaluating by local model using API call (vllm server). I find this approach can be quite useful when: -...