Results 34 comments of cinjon

- Still confused but our training runs are reasonable, so I gave up trying to guess theirs. - Yeah I was confused if it was 128 or 256. - I'm...

Hi again. I am struggling with this and made a reproduction for you to look at: https://gist.github.com/cinjon/de9a22f57cfa0dc9ccb2afc255a8093e. The main problem are the results below, which show roughly reproductions on gemma-27b,...

To be clear, it's not the "bfloat16" in the gist either - it's roughly the same result with "float32" too.

Thanks! How should I think about explicit casts in the huggingface repo then? For example, these in modeling_gemma: https://github.com/huggingface/transformers/blob/1bd604d11c405dfb8b78bda4062d88fc75c17de0/src/transformers/models/gemma/modeling_gemma.py#L62 https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma/modeling_gemma.py#L1087