lm-evaluation-harness issues

max_length not set correctly

1

Hi, It seems there is a problem with lm_eval when I am not setting 'max_length' with some tasks (at least GEM/wiki_lingua_en). When I am letting 'max_length' with default value, I...

hatimbr

It says `Multi-lingual ROUGE is unsupported as general token splitting is absent from [rouge-score](https://github.com/google-research/google-research/tree/master/rouge). For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE: English works as intended.`,...

Muennighoff

lm_eval.list_model_apis() not found

1

lm_eval.list_model_apis() not found

robertLiuLinFeng

translation evalution error

Hey I'm trying to evaluate bloom1b7 on translation task with this command ` python main.py --model_api_name hf-causal --model_args pretrained=bigscience/bloom-1b7 --task_name flores_101_mt_fewshot_en2bn --device cuda:1 ` But I got this error ```...

laozhanghahaha

Seq2Seq: Special tokens are also added to targets for LL computation

2

Location: https://github.com/bigscience-workshop/lm-evaluation-harness/blob/master/lm_eval/models/huggingface.py#L460 @jon-tow I'm not sure if special tokens should be included as part of the target sequence when doing the LL computation.

samsontmr

cache not storing predictions

The "--use_cache" argument only seems to be caching the model and not the predictions (contrarily to what is indicated in the readme). I am missing something here, or is this...

rbawden

WIP: Add Multieurlex

Muennighoff

Add xnli

9

Adding xnli to lm-evaluation-harness

gentaiscool

different score ranges are confusing

2

it's confusing bleu scores are 0-100 & rouge 0-1 in this repo; I think either all scores should 0-100 or 0-1, probably the former

Muennighoff

enhancement

Add multilingual tokenization for ROUGE

1

- Adds support for multilingual ROUGE scoring by providing language-specific tokenization via `nltk`. - Adds a `code_to_pycountry_lang` utility that maps ISO codes to `pycountry.db.Language` objects for robust language name parsing....

jon-tow

lm-evaluation-harness
lm-evaluation-harness copied to clipboard

Metadata

max_length not set correctly

Rouge score

lm_eval.list_model_apis() not found

translation evalution error

Seq2Seq: Special tokens are also added to targets for LL computation

cache not storing predictions

WIP: Add Multieurlex

Add xnli

different score ranges are confusing

Add multilingual tokenization for ROUGE

← Metadata

Owner

Metadata

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Metadata

← Metadata

Owner

Metadata

lm-evaluation-harness
lm-evaluation-harness copied to clipboard