DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

checking process_group before merging bucket ranges (#3521)

Open clumsy opened this issue 2 years ago • 4 comments

A simple check to make sure we compare partition_ids for the same process_group.

Fix #3521

clumsy avatar May 11 '23 20:05 clumsy

@clumsy, thanks for the PR. It looks good. But I am curious if this caused some failure? Also, can you add a unit test for this? Thanks!

tjruwase avatar May 11 '23 22:05 tjruwase

Never mind, I just saw the issue. Looking forward to the unit test. Thanks!

tjruwase avatar May 11 '23 22:05 tjruwase

Sure thing, @tjruwase, I was just able to reproduce this in a unit test with ZeRO2 world_size = ep_size = 4:

i=1 param_id=16 partition_id=0 prev_id=0 MoE:ep_size_4 pg_size=1 pg=0x7f20ac780530>
^-COALESCING-v
i=0 param_id=1 partition_id=0 prev_id=0 dense pg_size=4 pg=0x7f20ac780670>

Now I just need to put a proper assertion and I'll update the PR. Looks like in this case it's going to be a partially reduced gradients for dense parameters.

clumsy avatar May 11 '23 22:05 clumsy

Sorry not the prettiest of unit tests, @tjruwase. But it's the most robust one I was able to come up with: if we ever clamp together partitions for different process_groups using Tensor.narrow() inside average_tensor() this test will fail.

I also parametrized MoE test for different ZeRO stages to cover more ground.

Before fix:

FAILED tests/unit/moe/test_moe.py::TestMoE::test[True-2-2]
FAILED tests/unit/moe/test_moe.py::TestMoE::test[True-2-4]
============================================================================================= 2 failed, 10 passed, 2 deselected, 6 warnings in 162.20s (0:02:42) ==============================================================================================

After fix:

14.73s call     unit/moe/test_moe.py::TestMoE::test[True-2-4]
14.27s call     unit/moe/test_moe.py::TestMoE::test[False-2-4]
14.16s call     unit/moe/test_moe.py::TestMoE::test[False-2-2]
14.07s call     unit/moe/test_moe.py::TestMoE::test[False-1-4]
13.90s call     unit/moe/test_moe.py::TestMoE::test[True-0-2]
13.59s call     unit/moe/test_moe.py::TestMoE::test[True-0-4]
13.31s call     unit/moe/test_moe.py::TestMoE::test[False-1-2]
13.09s call     unit/moe/test_moe.py::TestMoE::test[True-1-4]
12.84s call     unit/moe/test_moe.py::TestMoE::test[False-0-4]
12.58s call     unit/moe/test_moe.py::TestMoE::test[True-2-2]
12.57s call     unit/moe/test_moe.py::TestMoE::test[True-1-2]
12.45s call     unit/moe/test_moe.py::TestMoE::test[False-0-2]

(24 durations < 1s hidden.  Use -vv to show these durations.)
================================================================================================== 12 passed, 2 deselected, 6 warnings in 162.79s (0:02:42) ===================================================================================================

clumsy avatar May 12 '23 20:05 clumsy

Closing since the fix has been merged to master branch. Please correct me if I'm wrong, @tjruwase.

clumsy avatar May 18 '23 18:05 clumsy

@clumsy, I don't believe this was merged. Can you please point me to merged PR? Thanks!

tjruwase avatar May 19 '23 01:05 tjruwase

@clumsy, I don't believe this was merged. Can you please point me to merged PR? Thanks!

My bad, @tjruwase, got confused with the previous PR, then we need to reopen this issue.

clumsy avatar May 19 '23 13:05 clumsy

Sorry, @tjruwase, my new PR ruined this one, recreated this one from scratch (with a new branch instead this time): https://github.com/microsoft/DeepSpeed/pull/3577

clumsy avatar May 19 '23 14:05 clumsy