DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

checking process_group before merging bucket ranges (#3521)

Open clumsy opened this issue 2 years ago • 7 comments

Recreating previous pull request: https://github.com/microsoft/DeepSpeed/pull/3522

Fixed #3521

clumsy avatar May 19 '23 14:05 clumsy

@tjruwase Here's the new PR so the same patch as before, sorry again.

clumsy avatar May 19 '23 18:05 clumsy

@clumsy, thanks for recreating the PR. It is greatly appreciated contribution.

tjruwase avatar May 19 '23 18:05 tjruwase

Please let me know if more work is required for this change to get merge, @tjruwase .

clumsy avatar May 31 '23 20:05 clumsy

Please let me know if more work is required for this change to get merge, @tjruwase .

Sorry, this dropped from my mind. No more work required. I have set auto-merge once CI completes. Thanks for the reminder.

tjruwase avatar May 31 '23 21:05 tjruwase

Hi @tjruwase, looks like seemingly unrelated TestHybridEngineTextGen test keeps failing. Is this the reason why this change cannot be merged?

clumsy avatar Jun 08 '23 16:06 clumsy

Hi @tjruwase, looks like seemingly unrelated TestHybridEngineTextGen test keeps failing. Is this the reason why this change cannot be merged?

This test failure is preventing the the auto-merge. But I don't think the test is related to your changes, so I have restarted the CI. Let's see what happens this time.

tjruwase avatar Jun 08 '23 16:06 tjruwase

@tjruwase there's some weird CUDA OOM going on, where a test (TestCompression.test_conv1d_convertion) I didn't even affect with my changes fails. Is the testing instance being shared by multiple CI workflows?

clumsy avatar Jun 16 '23 13:06 clumsy

Reducing micro batch size to try and avoid OOM. Please let me know if there's any concern with this, @tjruwase Just trying to get this fix merged.

clumsy avatar Jun 28 '23 14:06 clumsy

Still no luck, @tjruwase . Is there a known issue with CUDA OOM in DeepSpeed tests? I only added a few layers to SimpleMoEModel that is not that widely used, yet unrelated test fail too. Do test parameters need any tuning?

clumsy avatar Jun 28 '23 19:06 clumsy

Still no luck, @tjruwase . Is there a known issue with CUDA OOM in DeepSpeed tests? I only added a few layers to SimpleMoEModel that is not that widely used, yet unrelated test fail too. Do test parameters need any tuning?

Apologies, CI seem quite unstable lately. We are taking a closer look.

tjruwase avatar Jun 28 '23 20:06 tjruwase

Ok, looks like the checks run fine this time, @tjruwase

clumsy avatar Jun 29 '23 13:06 clumsy