Xingbo Wu
Xingbo Wu
I have the same issue with 0.14.1 when running a similar training script. Version 0.14.0 works. I cross-checked with torch 2.2.1, 2.2.2 and transformers 4.39.0, 4.39.3. The issue is with...
@tjruwase @mrwyattii `unscale_and_clip_grads`'s last update was two years ago. It might be the recent change in L2030 due to the use of `norm()`: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L2030 The commit: https://github.com/microsoft/DeepSpeed/commit/54c06872647ca60699f752e60ac1643bd05aa63c
Version 0.14.2 still has the same issue.