gemma HuggingFace Gemma-2-9b/27b incorrect

Hi, I was able to verify the MMLU score for HuggingFace gemma-2-9b-it to within .2. However, for gemma-2-27b-it, the score (52.3% on all) is way off. Is there some mistake on the repo there? Or is it particularly sensitive to bfloat16?

Oct 02 '24 18:10 cinjon

Hi @cinjon,

The MMLU scores for the Gemma-2-9B-IT and Gemma-2-27B-IT models are 71.3% and 75.2%, respectively. For further reference, please refer to this paper. The performance degradation observed in the Gemma-2-27B-IT model is likely due to its sensitivity to bfloat16 precision settings, which can impact inference quality if not handled properly.

For more detailed insights and related discussions, please check the following references: ref1 ref2

Thank you.

Oct 08 '24 08:10 Gopi-Uppari

Hi again. I am struggling with this and made a reproduction for you to look at: https://gist.github.com/cinjon/de9a22f57cfa0dc9ccb2afc255a8093e.

The main problem are the results below, which show roughly reproductions on gemma-27b, slight degradation on gemma-27b-it, slight degradation on gemma-2-9b, and terrible result on gemma-2-9b-it. What am I doing wrong? Thanks.

1. python -m huggingface_test_gemma_base_mmlu --model_name="google/gemma-2-9b"
--> all 0.7057399230878793
2. python -m huggingface_test_gemma_base_mmlu --model_name="google/gemma-2-9b-it"
--> all 0.6387266771115225
3. python -m huggingface_test_gemma_base_mmlu --model_name="google/gemma-2-27b-it"
--> all 0.7518159806295399
4. python -m huggingface_test_gemma_base_mmlu --model_name="google/gemma-2-27b"
--> all 0.7517447657028913

Oct 09 '24 21:10 cinjon

To be clear, it's not the "bfloat16" in the gist either - it's roughly the same result with "float32" too.

Oct 11 '24 01:10 cinjon

Hi,

Apologies for the late reply, thank you so much for bringing this to our attention. Please let me know if you are still getting same results, or if the issue is resolved.

Thanks.

Aug 14 '25 05:08 Balakrishna-Chennamsetti