DeepSpeed z3 scaled_global_grad_norm: repalce get_global

z3 scaled_global_grad_norm: repalce get_global_norm with torch.norm

Open nelyahu opened this issue 9 months ago • 1 comments

May 07 '24 07:05 nelyahu

Changing this line has been associated with several bugs https://github.com/microsoft/DeepSpeed/issues/5422, https://github.com/microsoft/DeepSpeed/issues/5538

May 23 '24 17:05 jomayeri

Changing this line has been associated with several bugs #5422, #5538

@nelyahu - thoughts on this comment, seems last time this line was modified users ran into issues?

Jun 26 '24 17:06 loadams

Changing this line has been associated with several bugs #5422, #5538

@nelyahu - thoughts on this comment, seems last time this line was modified users ran into issues?

@loadams, Yes - this optimization was already pushed and reverted due to ds-chat (failures in cpu-offload configurations). i did offline debugging of those failure and improved the code change so it will pass. Since then ds-chat tests where added to DeepSpeed repo CI and it is now passing. Are there any other tests (full model training for example), that does not exists in the CI, which can be manually ran?

Jun 26 '24 20:06 nelyahu

i did offline debugging of those failure and improved the code change so it will pass

@nelyahu, it great that you narrowed this down. Do you think a unit test can be added for this case?

Jun 26 '24 23:06 tjruwase

i did offline debugging of those failure and improved the code change so it will pass

@nelyahu, it great that you narrowed this down. Do you think a unit test can be added for this case?

@nelyahu - we've stabilized the CI, thoughts on adding this test?

Jul 09 '24 21:07 loadams

@loadams Oh, Sorry- I missed the last comment. Sure, yes we can add such UT that will cover it. But i cannot address it immediately. I will update this PR once we have a unit test.

Jul 10 '24 02:07 nelyahu

@loadams / @tjruwase as request i make sure the regression was discussed here will be covered by a unit tests. I used TestZeroPartialOffloadConfigSweep, and added it gradient_clipping so it will go though the problematic flow. i reproduced the issue using the UT, and made sure it is fixed.

Jul 18 '24 15:07 nelyahu

@loadams can you re-run workflows? i suspect that the failure are not related to this PR

Jul 23 '24 08:07 nelyahu

DeepSpeed DeepSpeed copied to clipboard

z3 scaled_global_grad_norm: repalce get_global_norm with torch.norm

DeepSpeed
DeepSpeed copied to clipboard