HuggingFace Gemma-2-9b/27b incorrect
Hi, I was able to verify the MMLU score for HuggingFace gemma-2-9b-it to within .2. However, for gemma-2-27b-it, the score (52.3% on all) is way off. Is there some mistake on the repo there? Or is it particularly sensitive to bfloat16?
Hi @cinjon,
The MMLU scores for the Gemma-2-9B-IT and Gemma-2-27B-IT models are 71.3% and 75.2%, respectively. For further reference, please refer to this paper. The performance degradation observed in the Gemma-2-27B-IT model is likely due to its sensitivity to bfloat16 precision settings, which can impact inference quality if not handled properly.
For more detailed insights and related discussions, please check the following references: ref1 ref2
Thank you.
Hi again. I am struggling with this and made a reproduction for you to look at: https://gist.github.com/cinjon/de9a22f57cfa0dc9ccb2afc255a8093e.
The main problem are the results below, which show roughly reproductions on gemma-27b, slight degradation on gemma-27b-it, slight degradation on gemma-2-9b, and terrible result on gemma-2-9b-it. What am I doing wrong? Thanks.
1. python -m huggingface_test_gemma_base_mmlu --model_name="google/gemma-2-9b"
--> all 0.7057399230878793
2. python -m huggingface_test_gemma_base_mmlu --model_name="google/gemma-2-9b-it"
--> all 0.6387266771115225
3. python -m huggingface_test_gemma_base_mmlu --model_name="google/gemma-2-27b-it"
--> all 0.7518159806295399
4. python -m huggingface_test_gemma_base_mmlu --model_name="google/gemma-2-27b"
--> all 0.7517447657028913
To be clear, it's not the "bfloat16" in the gist either - it's roughly the same result with "float32" too.
Hi,
Apologies for the late reply, thank you so much for bringing this to our attention. Please let me know if you are still getting same results, or if the issue is resolved.
Thanks.