DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Remove redundant layer norm operation
In the pre-layernorm version of BERT, the application of layernorm on the embeddings is redundant since it is applied by the first transformer layer as well.
For reference: https://github.com/NVIDIA/Megatron-LM/blob/19301985dd31c8b612095cbad15bd903e8ddd497/megatron/model/language_model.py#L165
I don't have permissions to merge the pull request. So could someone else from the deepspeed team do that?
Hi Owais,
Thanks again for pointing this possible bug in the deepspeed example. We are discussing this in the team and will merge it soon if there is no accuracy impact!
Thanks, Reza