lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

truthfulqa_mc2 is Nan, while truthfulqa_mc1 is 1.00

Open chi2liu opened this issue 2 years ago • 5 comments
trafficstars

I have finetued a model based on llama-2-hf, and run the evaluation with code and get truthfulqa_mc2 is Nan, while truthfulqa_mc1 is 1.00.

What does that means?

python main.py --model hf-causal-experimental --model_args pretrained=../mamba-gpt-7b-v2 --tasks anli_r1,anli_r2,anli_r3,arc_challenge,arc_easy,boolq,hellaswag,openbookqa,piqa,record,rte,truthfulqa_mc,wic,winogrande --device cuda:0

hf-causal-experimental (pretrained=../mamba-gpt-7b-v2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
anli_r1 0 acc 0.3340 ± 0.0149
anli_r2 0 acc 0.3340 ± 0.0149
anli_r3 0 acc 0.3350 ± 0.0136
arc_challenge 0 acc 0.2270 ± 0.0122
acc_norm 0.2270 ± 0.0122
arc_easy 0 acc 0.2508 ± 0.0089
acc_norm 0.2508 ± 0.0089
boolq 1 acc 0.3783 ± 0.0085
hellaswag 0 acc 0.2504 ± 0.0043
acc_norm 0.2504 ± 0.0043
openbookqa 0 acc 0.2760 ± 0.0200
acc_norm 0.2760 ± 0.0200
piqa 0 acc 0.4951 ± 0.0117
acc_norm 0.4951 ± 0.0117
record 0 f1 0.1186 ± 0.0032
em 0.1151 ± 0.0032
rte 0 acc 0.5271 ± 0.0301
truthfulqa_mc 1 mc1 1.0000 ± 0.0000
mc2 NaN ± NaN
wic 0 acc 0.5000 ± 0.0198
winogrande 0 acc 0.4957 ± 0.0141

chi2liu avatar Jul 31 '23 08:07 chi2liu

I have same issue!But I have done some operation to change or move the lora weight in my code. Have you solved it?

I have finetued a model based on llama-2-hf, and run the evaluation with code and get truthfulqa_mc2 is Nan, while truthfulqa_mc1 is 1.00.

What does that means?

python main.py --model hf-causal-experimental --model_args pretrained=../mamba-gpt-7b-v2 --tasks anli_r1,anli_r2,anli_r3,arc_challenge,arc_easy,boolq,hellaswag,openbookqa,piqa,record,rte,truthfulqa_mc,wic,winogrande --device cuda:0

hf-causal-experimental (pretrained=../mamba-gpt-7b-v2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr anli_r1 0 acc 0.3340 ± 0.0149 anli_r2 0 acc 0.3340 ± 0.0149 anli_r3 0 acc 0.3350 ± 0.0136 arc_challenge 0 acc 0.2270 ± 0.0122 acc_norm 0.2270 ± 0.0122 arc_easy 0 acc 0.2508 ± 0.0089 acc_norm 0.2508 ± 0.0089 boolq 1 acc 0.3783 ± 0.0085 hellaswag 0 acc 0.2504 ± 0.0043 acc_norm 0.2504 ± 0.0043 openbookqa 0 acc 0.2760 ± 0.0200 acc_norm 0.2760 ± 0.0200 piqa 0 acc 0.4951 ± 0.0117 acc_norm 0.4951 ± 0.0117 record 0 f1 0.1186 ± 0.0032 em 0.1151 ± 0.0032 rte 0 acc 0.5271 ± 0.0301 truthfulqa_mc 1 mc1 1.0000 ± 0.0000 mc2 NaN ± NaN wic 0 acc 0.5000 ± 0.0198 winogrande 0 acc 0.4957 ± 0.0141

505707566 avatar Nov 22 '23 08:11 505707566

This issue should be solved in the main branch.

lintangsutawika avatar Dec 14 '23 08:12 lintangsutawika

@lintangsutawika I used the main branch and the issue is still there issue opened https://github.com/EleutherAI/lm-evaluation-harness/issues/1340

hahmad2008 avatar Jan 23 '24 12:01 hahmad2008

@lintangsutawika How to fix it? Can you share the PR? Thanks

choco9966 avatar Apr 23 '24 12:04 choco9966

@choco9966 can you share a public model + sample command that reproduces this issue?

haileyschoelkopf avatar Apr 26 '24 15:04 haileyschoelkopf