DeepSpeed
DeepSpeed copied to clipboard
checking process_group before merging bucket ranges (#3521)
Recreating previous pull request: https://github.com/microsoft/DeepSpeed/pull/3522
Fixed #3521
@tjruwase Here's the new PR so the same patch as before, sorry again.
@clumsy, thanks for recreating the PR. It is greatly appreciated contribution.
Please let me know if more work is required for this change to get merge, @tjruwase .
Please let me know if more work is required for this change to get merge, @tjruwase .
Sorry, this dropped from my mind. No more work required. I have set auto-merge once CI completes. Thanks for the reminder.
Hi @tjruwase, looks like seemingly unrelated TestHybridEngineTextGen test keeps failing. Is this the reason why this change cannot be merged?
Hi @tjruwase, looks like seemingly unrelated
TestHybridEngineTextGentest keeps failing. Is this the reason why this change cannot be merged?
This test failure is preventing the the auto-merge. But I don't think the test is related to your changes, so I have restarted the CI. Let's see what happens this time.
@tjruwase there's some weird CUDA OOM going on, where a test (TestCompression.test_conv1d_convertion) I didn't even affect with my changes fails. Is the testing instance being shared by multiple CI workflows?
Reducing micro batch size to try and avoid OOM. Please let me know if there's any concern with this, @tjruwase Just trying to get this fix merged.
Still no luck, @tjruwase . Is there a known issue with CUDA OOM in DeepSpeed tests? I only added a few layers to SimpleMoEModel that is not that widely used, yet unrelated test fail too. Do test parameters need any tuning?
Still no luck, @tjruwase . Is there a known issue with CUDA OOM in DeepSpeed tests? I only added a few layers to SimpleMoEModel that is not that widely used, yet unrelated test fail too. Do test parameters need any tuning?
Apologies, CI seem quite unstable lately. We are taking a closer look.
Ok, looks like the checks run fine this time, @tjruwase