DeepSpeed
DeepSpeed copied to clipboard
z3 scaled_global_grad_norm: repalce get_global_norm with torch.norm
Changing this line has been associated with several bugs https://github.com/microsoft/DeepSpeed/issues/5422, https://github.com/microsoft/DeepSpeed/issues/5538
Changing this line has been associated with several bugs #5422, #5538
@nelyahu - thoughts on this comment, seems last time this line was modified users ran into issues?
Changing this line has been associated with several bugs #5422, #5538
@nelyahu - thoughts on this comment, seems last time this line was modified users ran into issues?
@loadams, Yes - this optimization was already pushed and reverted due to ds-chat (failures in cpu-offload configurations). i did offline debugging of those failure and improved the code change so it will pass. Since then ds-chat tests where added to DeepSpeed repo CI and it is now passing. Are there any other tests (full model training for example), that does not exists in the CI, which can be manually ran?
i did offline debugging of those failure and improved the code change so it will pass
@nelyahu, it great that you narrowed this down. Do you think a unit test can be added for this case?
i did offline debugging of those failure and improved the code change so it will pass
@nelyahu, it great that you narrowed this down. Do you think a unit test can be added for this case?
@nelyahu - we've stabilized the CI, thoughts on adding this test?
@loadams Oh, Sorry- I missed the last comment. Sure, yes we can add such UT that will cover it. But i cannot address it immediately. I will update this PR once we have a unit test.
@loadams / @tjruwase as request i make sure the regression was discussed here will be covered by a unit tests. I used TestZeroPartialOffloadConfigSweep, and added it gradient_clipping so it will go though the problematic flow. i reproduced the issue using the UT, and made sure it is fixed.
@loadams can you re-run workflows? i suspect that the failure are not related to this PR