mixtral-8x7b: Reference Implementation Accuracy Failure on H200
When running reference implementation on H200, I see an accuracy failure:
| Metric | Target Score | H200 Reference Implementation | Percentage Diff |
|---|---|---|---|
| rouge1 | 45.5989 | 45.127 | 1.034893386 |
| rouge2 | 23.3526 | 22.9785 | 1.601962951 |
| rougeL | 30.4608 | 30.4806 | 0.065001576 |
| gsm8k | 73.66 | 74.06 | 0.543035569 |
| mbxp | 60.16 | 60.22 | 0.099734043 |
| tokens per sample | 144.84 | 283.5 | 95.73322287 |
@pgmpablo157321 @nvzhihanj @arjunsuresh : Any comments?
Hi @mrmhodak we are running the full accuracy run for this. But it won't be finishing until Thursday.
We did the dataset update for Mixtral this round (for the EOS issue). Were you running on the latest dataset and latest settings (i.e. min_output_len=2)? We will launch a local run to verify as well
@nvzhihanj : Yes, all latest, freshly downloaded according to latest instructions using rclone.
@arjunsuresh @nvzhihanj @pgmpablo157321: Any update on this?
I am able to re-run the standalone script and double-check the accuracy of the model
Evaluating GSM8K score...
EM: 0.7366, correct: 3683 / 5000, gen_token_per_sample: 129.9604
Evaluating OpenOrca score...
OpenOrca score: {'rouge1': np.float64(45.5989), 'rouge2': np.float64(23.3526), 'rougeL': np.float64(30.4608), 'rougeLsum': np.float64(42.5396)}, gen_token_per_sample: 205.8656
Evaluating MBXP score...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [02:33<00:00, 32.50it/s]
Processed 5000 in 153.89411109898356s
60.16% pass@1
{'cpp': 381, 'typescript': 438, 'ruby': 419, 'python': 492, 'php': 809, 'javascript': 469} out of {'cpp': 743, 'typescript': 868, 'ruby': 846, 'python': 863, 'php': 846, 'javascript': 834}
gen_tokens_per_sample: 98.7026
The bug must be in the reference implementation FYI @pgmpablo157321 , I will check in the standalone script to the repo later. One thing: please make sure you use the checkpoint downloaded from the mlcommon cloud, not the public one.
I added the reference standalone scripts in https://github.com/mlcommons/inference/pull/2029 and formalize the docker workflow. For the reference implementation, @pgmpablo157321 can you help the discrepancy between the standalone and the existing code?
@nvzhihanj Working on this