lm-evaluation-harness
lm-evaluation-harness copied to clipboard
A framework for few-shot evaluation of autoregressive language models.
Hi, It seems there is a problem with lm_eval when I am not setting 'max_length' with some tasks (at least GEM/wiki_lingua_en). When I am letting 'max_length' with default value, I...
It says `Multi-lingual ROUGE is unsupported as general token splitting is absent from [rouge-score](https://github.com/google-research/google-research/tree/master/rouge). For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE: English works as intended.`,...
lm_eval.list_model_apis() not found
Hey I'm trying to evaluate bloom1b7 on translation task with this command ` python main.py --model_api_name hf-causal --model_args pretrained=bigscience/bloom-1b7 --task_name flores_101_mt_fewshot_en2bn --device cuda:1 ` But I got this error ```...
Location: https://github.com/bigscience-workshop/lm-evaluation-harness/blob/master/lm_eval/models/huggingface.py#L460 @jon-tow I'm not sure if special tokens should be included as part of the target sequence when doing the LL computation.
The "--use_cache" argument only seems to be caching the model and not the predictions (contrarily to what is indicated in the readme). I am missing something here, or is this...
Add xnli
Adding xnli to lm-evaluation-harness
it's confusing bleu scores are 0-100 & rouge 0-1 in this repo; I think either all scores should 0-100 or 0-1, probably the former
- Adds support for multilingual ROUGE scoring by providing language-specific tokenization via `nltk`. - Adds a `code_to_pycountry_lang` utility that maps ISO codes to `pycountry.db.Language` objects for robust language name parsing....