maxtext
maxtext copied to clipboard
Megatron style TFLOPs Calculation
@rwitten this is a draft.
This type of change would be specific to a few transformer models (e.g., Gemma, LLama, GPT, etc.). It wouldn't work with MoE, or some new architectures.
I was thinking that walking through the train-step and calculating the FLOPs layer-by-layer would be a very intrusive change.
What do you think?
Made the changes as requested in the meeting @rwitten
cc @rwitten following up on this