renhouxing

Results 2 comments of renhouxing

A possible reason is that the local mcore model does not support flash-attn. https://github.com/NVIDIA/Megatron-LM/blob/core_v0.6.0/megatron/core/models/gpt/gpt_layer_specs.py#L53

@loadams I also encountered the same problem. More exp: deepspeed==0.12.4, zero-2, multi-node. N (grad_norm always be 1.0, and loss 0) deepspeed==0.12.4, zero-2, one-node. Y deepspeed==0.12.4, zero-3, multi-node. Y deepspeed==0.12.4, zero-3,...