instruct-eval
instruct-eval copied to clipboard
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
There are few errors occurring. With [instructionBERT](https://huggingface.co/Bachstelze/instructionBERT): `python main.py drop --model_name seq_to_seq --model_path Bachstelze/instructionBERT` > Traceback (most recent call last): File "main.py", line 98, in Fire(main) File "/home/hilsenbek/.conda/envs/instruct-eval/lib/python3.8/site-packages/fire/core.py", line 141,...
How can we use the scripts in a colab notebook? There are installation problems with or without conda and also after a restart. > WARNING: The following packages were previously...
Why isn't the crass script in the examples? Or is there somewhere a detailed documentation?
Hi, on a single 4090 GPU with 24GB memory, the following command will cause out-of-memory. ```bash python main.py mmlu --model_name llama --model_path huggyllama/llama-7b ``` After that, I try executing the...
Hi there, Will it be possible to submit our own model to the leaderboard?
Hi, I found that the prompt generated from the dataset (ex: MMLU) is not wrapped according to the model's prompt template. The performance you'll get out of the model will...
Accuracy? Exact match? F1-score? I cannot find the description in the paper: 
### Description Merge arguments and kwargs at https://github.com/declare-lab/instruct-eval/blob/1b4f253076ce6c36309da44d82f2d8b67afc886a/modeling.py#L156 to avoid multiple values for keyword arguments ### Related Issue https://github.com/declare-lab/instruct-eval/issues/22.
Great thanks for your work! I try exacy the same setting but I got different results on MMLU and BBH. The alpaca-tuned llama always perform worse than original llama(7B or...
Hi I try to evaluate the accuracy of chavinlo/alpaca-native on MMLU. I find the final accuracy is about 36 and I cannot reproduce the result about 41.6. May I ask...