inference mixtral-8x7b: Reference Implementation Accuracy Failure on H200

When running reference implementation on H200, I see an accuracy failure:

Metric	Target Score	H200 Reference Implementation	Percentage Diff
rouge1	45.5989	45.127	1.034893386
rouge2	23.3526	22.9785	1.601962951
rougeL	30.4608	30.4806	0.065001576
gsm8k	73.66	74.06	0.543035569
mbxp	60.16	60.22	0.099734043
tokens per sample	144.84	283.5	95.73322287

Jan 07 '25 05:01 mrmhodak

@pgmpablo157321 @nvzhihanj @arjunsuresh : Any comments?

Jan 07 '25 05:01 mrmhodak

Hi @mrmhodak we are running the full accuracy run for this. But it won't be finishing until Thursday.

Jan 07 '25 15:01 arjunsuresh

We did the dataset update for Mixtral this round (for the EOS issue). Were you running on the latest dataset and latest settings (i.e. min_output_len=2)? We will launch a local run to verify as well

Jan 07 '25 17:01 nvzhihanj

@nvzhihanj : Yes, all latest, freshly downloaded according to latest instructions using rclone.

Jan 07 '25 17:01 mrmhodak

@arjunsuresh @nvzhihanj @pgmpablo157321: Any update on this?

Jan 10 '25 07:01 mrmhodak

I am able to re-run the standalone script and double-check the accuracy of the model

Evaluating GSM8K score...
EM: 0.7366, correct: 3683 / 5000, gen_token_per_sample: 129.9604
Evaluating OpenOrca score...
OpenOrca score: {'rouge1': np.float64(45.5989), 'rouge2': np.float64(23.3526), 'rougeL': np.float64(30.4608), 'rougeLsum': np.float64(42.5396)}, gen_token_per_sample: 205.8656
Evaluating MBXP score...
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5000/5000 [02:33<00:00, 32.50it/s]
Processed 5000 in 153.89411109898356s
 60.16% pass@1
{'cpp': 381, 'typescript': 438, 'ruby': 419, 'python': 492, 'php': 809, 'javascript': 469}  out of  {'cpp': 743, 'typescript': 868, 'ruby': 846, 'python': 863, 'php': 846, 'javascript': 834}
gen_tokens_per_sample: 98.7026

The bug must be in the reference implementation FYI @pgmpablo157321 , I will check in the standalone script to the repo later. One thing: please make sure you use the checkpoint downloaded from the mlcommon cloud, not the public one.

Jan 13 '25 00:01 nvzhihanj

I added the reference standalone scripts in https://github.com/mlcommons/inference/pull/2029 and formalize the docker workflow. For the reference implementation, @pgmpablo157321 can you help the discrepancy between the standalone and the existing code?

Jan 13 '25 20:01 nvzhihanj

@nvzhihanj Working on this

Jan 13 '25 20:01 pgmpablo157321