DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

z3 scaled_global_grad_norm: repalce get_global_norm with torch.norm

Open nelyahu opened this issue 9 months ago • 1 comments

nelyahu avatar May 07 '24 07:05 nelyahu

Changing this line has been associated with several bugs https://github.com/microsoft/DeepSpeed/issues/5422, https://github.com/microsoft/DeepSpeed/issues/5538

jomayeri avatar May 23 '24 17:05 jomayeri

Changing this line has been associated with several bugs #5422, #5538

@nelyahu - thoughts on this comment, seems last time this line was modified users ran into issues?

loadams avatar Jun 26 '24 17:06 loadams

Changing this line has been associated with several bugs #5422, #5538

@nelyahu - thoughts on this comment, seems last time this line was modified users ran into issues?

@loadams, Yes - this optimization was already pushed and reverted due to ds-chat (failures in cpu-offload configurations). i did offline debugging of those failure and improved the code change so it will pass. Since then ds-chat tests where added to DeepSpeed repo CI and it is now passing. Are there any other tests (full model training for example), that does not exists in the CI, which can be manually ran?

nelyahu avatar Jun 26 '24 20:06 nelyahu

i did offline debugging of those failure and improved the code change so it will pass

@nelyahu, it great that you narrowed this down. Do you think a unit test can be added for this case?

tjruwase avatar Jun 26 '24 23:06 tjruwase

i did offline debugging of those failure and improved the code change so it will pass

@nelyahu, it great that you narrowed this down. Do you think a unit test can be added for this case?

@nelyahu - we've stabilized the CI, thoughts on adding this test?

loadams avatar Jul 09 '24 21:07 loadams

@loadams Oh, Sorry- I missed the last comment. Sure, yes we can add such UT that will cover it. But i cannot address it immediately. I will update this PR once we have a unit test.

nelyahu avatar Jul 10 '24 02:07 nelyahu

@loadams / @tjruwase as request i make sure the regression was discussed here will be covered by a unit tests. I used TestZeroPartialOffloadConfigSweep, and added it gradient_clipping so it will go though the problematic flow. i reproduced the issue using the UT, and made sure it is fixed.

nelyahu avatar Jul 18 '24 15:07 nelyahu

@loadams can you re-run workflows? i suspect that the failure are not related to this PR

nelyahu avatar Jul 23 '24 08:07 nelyahu