inference icon indicating copy to clipboard operation
inference copied to clipboard

mixtral-8x7b: Reference Implementation Accuracy Failure on H200

Open mrmhodak opened this issue 1 year ago • 8 comments

When running reference implementation on H200, I see an accuracy failure:

Metric Target Score H200 Reference Implementation Percentage Diff
rouge1 45.5989 45.127 1.034893386
rouge2 23.3526 22.9785 1.601962951
rougeL 30.4608 30.4806 0.065001576
gsm8k 73.66 74.06 0.543035569
mbxp 60.16 60.22 0.099734043
tokens per sample 144.84 283.5 95.73322287

mrmhodak avatar Jan 07 '25 05:01 mrmhodak

@pgmpablo157321 @nvzhihanj @arjunsuresh : Any comments?

mrmhodak avatar Jan 07 '25 05:01 mrmhodak

Hi @mrmhodak we are running the full accuracy run for this. But it won't be finishing until Thursday.

arjunsuresh avatar Jan 07 '25 15:01 arjunsuresh

We did the dataset update for Mixtral this round (for the EOS issue). Were you running on the latest dataset and latest settings (i.e. min_output_len=2)? We will launch a local run to verify as well

nvzhihanj avatar Jan 07 '25 17:01 nvzhihanj

@nvzhihanj : Yes, all latest, freshly downloaded according to latest instructions using rclone.

mrmhodak avatar Jan 07 '25 17:01 mrmhodak

@arjunsuresh @nvzhihanj @pgmpablo157321: Any update on this?

mrmhodak avatar Jan 10 '25 07:01 mrmhodak

I am able to re-run the standalone script and double-check the accuracy of the model

Evaluating GSM8K score...
EM: 0.7366, correct: 3683 / 5000, gen_token_per_sample: 129.9604
Evaluating OpenOrca score...
OpenOrca score: {'rouge1': np.float64(45.5989), 'rouge2': np.float64(23.3526), 'rougeL': np.float64(30.4608), 'rougeLsum': np.float64(42.5396)}, gen_token_per_sample: 205.8656
Evaluating MBXP score...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [02:33<00:00, 32.50it/s]
Processed 5000 in 153.89411109898356s
 60.16% pass@1
{'cpp': 381, 'typescript': 438, 'ruby': 419, 'python': 492, 'php': 809, 'javascript': 469}  out of  {'cpp': 743, 'typescript': 868, 'ruby': 846, 'python': 863, 'php': 846, 'javascript': 834}
gen_tokens_per_sample: 98.7026

The bug must be in the reference implementation FYI @pgmpablo157321 , I will check in the standalone script to the repo later. One thing: please make sure you use the checkpoint downloaded from the mlcommon cloud, not the public one.

nvzhihanj avatar Jan 13 '25 00:01 nvzhihanj

I added the reference standalone scripts in https://github.com/mlcommons/inference/pull/2029 and formalize the docker workflow. For the reference implementation, @pgmpablo157321 can you help the discrepancy between the standalone and the existing code?

nvzhihanj avatar Jan 13 '25 20:01 nvzhihanj

@nvzhihanj Working on this

pgmpablo157321 avatar Jan 13 '25 20:01 pgmpablo157321