lm-evaluation-harness
lm-evaluation-harness copied to clipboard
truthfulqa_mc2 is Nan, while truthfulqa_mc1 is 1.00
I have finetued a model based on llama-2-hf, and run the evaluation with code and get truthfulqa_mc2 is Nan, while truthfulqa_mc1 is 1.00.
What does that means?
python main.py --model hf-causal-experimental --model_args pretrained=../mamba-gpt-7b-v2 --tasks anli_r1,anli_r2,anli_r3,arc_challenge,arc_easy,boolq,hellaswag,openbookqa,piqa,record,rte,truthfulqa_mc,wic,winogrande --device cuda:0
hf-causal-experimental (pretrained=../mamba-gpt-7b-v2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| anli_r1 | 0 | acc | 0.3340 | ± | 0.0149 |
| anli_r2 | 0 | acc | 0.3340 | ± | 0.0149 |
| anli_r3 | 0 | acc | 0.3350 | ± | 0.0136 |
| arc_challenge | 0 | acc | 0.2270 | ± | 0.0122 |
| acc_norm | 0.2270 | ± | 0.0122 | ||
| arc_easy | 0 | acc | 0.2508 | ± | 0.0089 |
| acc_norm | 0.2508 | ± | 0.0089 | ||
| boolq | 1 | acc | 0.3783 | ± | 0.0085 |
| hellaswag | 0 | acc | 0.2504 | ± | 0.0043 |
| acc_norm | 0.2504 | ± | 0.0043 | ||
| openbookqa | 0 | acc | 0.2760 | ± | 0.0200 |
| acc_norm | 0.2760 | ± | 0.0200 | ||
| piqa | 0 | acc | 0.4951 | ± | 0.0117 |
| acc_norm | 0.4951 | ± | 0.0117 | ||
| record | 0 | f1 | 0.1186 | ± | 0.0032 |
| em | 0.1151 | ± | 0.0032 | ||
| rte | 0 | acc | 0.5271 | ± | 0.0301 |
| truthfulqa_mc | 1 | mc1 | 1.0000 | ± | 0.0000 |
| mc2 | NaN | ± | NaN | ||
| wic | 0 | acc | 0.5000 | ± | 0.0198 |
| winogrande | 0 | acc | 0.4957 | ± | 0.0141 |
I have same issue!But I have done some operation to change or move the lora weight in my code. Have you solved it?
I have finetued a model based on llama-2-hf, and run the evaluation with code and get truthfulqa_mc2 is Nan, while truthfulqa_mc1 is 1.00.
What does that means?
python main.py --model hf-causal-experimental --model_args pretrained=../mamba-gpt-7b-v2 --tasks anli_r1,anli_r2,anli_r3,arc_challenge,arc_easy,boolq,hellaswag,openbookqa,piqa,record,rte,truthfulqa_mc,wic,winogrande --device cuda:0hf-causal-experimental (pretrained=../mamba-gpt-7b-v2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
Task Version Metric Value Stderr anli_r1 0 acc 0.3340 ± 0.0149 anli_r2 0 acc 0.3340 ± 0.0149 anli_r3 0 acc 0.3350 ± 0.0136 arc_challenge 0 acc 0.2270 ± 0.0122 acc_norm 0.2270 ± 0.0122 arc_easy 0 acc 0.2508 ± 0.0089 acc_norm 0.2508 ± 0.0089 boolq 1 acc 0.3783 ± 0.0085 hellaswag 0 acc 0.2504 ± 0.0043 acc_norm 0.2504 ± 0.0043 openbookqa 0 acc 0.2760 ± 0.0200 acc_norm 0.2760 ± 0.0200 piqa 0 acc 0.4951 ± 0.0117 acc_norm 0.4951 ± 0.0117 record 0 f1 0.1186 ± 0.0032 em 0.1151 ± 0.0032 rte 0 acc 0.5271 ± 0.0301 truthfulqa_mc 1 mc1 1.0000 ± 0.0000 mc2 NaN ± NaN wic 0 acc 0.5000 ± 0.0198 winogrande 0 acc 0.4957 ± 0.0141
This issue should be solved in the main branch.
@lintangsutawika I used the main branch and the issue is still there issue opened https://github.com/EleutherAI/lm-evaluation-harness/issues/1340
@lintangsutawika How to fix it? Can you share the PR? Thanks
@choco9966 can you share a public model + sample command that reproduces this issue?