DeepSpeedExamples Remove redundant layer norm operation

Remove redundant layer norm operation

Open owmohamm opened this issue 4 years ago • 2 comments

In the pre-layernorm version of BERT, the application of layernorm on the embeddings is redundant since it is applied by the first transformer layer as well.

For reference: https://github.com/NVIDIA/Megatron-LM/blob/19301985dd31c8b612095cbad15bd903e8ddd497/megatron/model/language_model.py#L165

Dec 05 '20 22:12 owmohamm

I don't have permissions to merge the pull request. So could someone else from the deepspeed team do that?

Dec 08 '20 17:12 owmohamm

Hi Owais,

Thanks again for pointing this possible bug in the deepspeed example. We are discussing this in the team and will merge it soon if there is no accuracy impact!

Thanks, Reza

Dec 08 '20 18:12 RezaYazdaniAminabadi

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Remove redundant layer norm operation

DeepSpeedExamples
DeepSpeedExamples copied to clipboard