lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

A framework for few-shot evaluation of autoregressive language models.

Results 15 lm-evaluation-harness issues
Sort by recently updated
recently updated
newest added

Hi, It seems there is a problem with lm_eval when I am not setting 'max_length' with some tasks (at least GEM/wiki_lingua_en). When I am letting 'max_length' with default value, I...

It says `Multi-lingual ROUGE is unsupported as general token splitting is absent from [rouge-score](https://github.com/google-research/google-research/tree/master/rouge). For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE: English works as intended.`,...

lm_eval.list_model_apis() not found

Hey I'm trying to evaluate bloom1b7 on translation task with this command ` python main.py --model_api_name hf-causal --model_args pretrained=bigscience/bloom-1b7 --task_name flores_101_mt_fewshot_en2bn --device cuda:1 ` But I got this error ```...

Location: https://github.com/bigscience-workshop/lm-evaluation-harness/blob/master/lm_eval/models/huggingface.py#L460 @jon-tow I'm not sure if special tokens should be included as part of the target sequence when doing the LL computation.

The "--use_cache" argument only seems to be caching the model and not the predictions (contrarily to what is indicated in the readme). I am missing something here, or is this...

Adding xnli to lm-evaluation-harness

it's confusing bleu scores are 0-100 & rouge 0-1 in this repo; I think either all scores should 0-100 or 0-1, probably the former

enhancement

- Adds support for multilingual ROUGE scoring by providing language-specific tokenization via `nltk`. - Adds a `code_to_pycountry_lang` utility that maps ISO codes to `pycountry.db.Language` objects for robust language name parsing....