DeepSpeed
DeepSpeed copied to clipboard
checking process_group before merging bucket ranges (#3521)
A simple check to make sure we compare partition_ids for the same process_group.
Fix #3521
@clumsy, thanks for the PR. It looks good. But I am curious if this caused some failure? Also, can you add a unit test for this? Thanks!
Never mind, I just saw the issue. Looking forward to the unit test. Thanks!
Sure thing, @tjruwase, I was just able to reproduce this in a unit test with ZeRO2 world_size = ep_size = 4:
i=1 param_id=16 partition_id=0 prev_id=0 MoE:ep_size_4 pg_size=1 pg=0x7f20ac780530>
^-COALESCING-v
i=0 param_id=1 partition_id=0 prev_id=0 dense pg_size=4 pg=0x7f20ac780670>
Now I just need to put a proper assertion and I'll update the PR. Looks like in this case it's going to be a partially reduced gradients for dense parameters.
Sorry not the prettiest of unit tests, @tjruwase. But it's the most robust one I was able to come up with: if we ever clamp together partitions for different process_groups using Tensor.narrow() inside average_tensor() this test will fail.
I also parametrized MoE test for different ZeRO stages to cover more ground.
Before fix:
FAILED tests/unit/moe/test_moe.py::TestMoE::test[True-2-2]
FAILED tests/unit/moe/test_moe.py::TestMoE::test[True-2-4]
============================================================================================= 2 failed, 10 passed, 2 deselected, 6 warnings in 162.20s (0:02:42) ==============================================================================================
After fix:
14.73s call unit/moe/test_moe.py::TestMoE::test[True-2-4]
14.27s call unit/moe/test_moe.py::TestMoE::test[False-2-4]
14.16s call unit/moe/test_moe.py::TestMoE::test[False-2-2]
14.07s call unit/moe/test_moe.py::TestMoE::test[False-1-4]
13.90s call unit/moe/test_moe.py::TestMoE::test[True-0-2]
13.59s call unit/moe/test_moe.py::TestMoE::test[True-0-4]
13.31s call unit/moe/test_moe.py::TestMoE::test[False-1-2]
13.09s call unit/moe/test_moe.py::TestMoE::test[True-1-4]
12.84s call unit/moe/test_moe.py::TestMoE::test[False-0-4]
12.58s call unit/moe/test_moe.py::TestMoE::test[True-2-2]
12.57s call unit/moe/test_moe.py::TestMoE::test[True-1-2]
12.45s call unit/moe/test_moe.py::TestMoE::test[False-0-2]
(24 durations < 1s hidden. Use -vv to show these durations.)
================================================================================================== 12 passed, 2 deselected, 6 warnings in 162.79s (0:02:42) ===================================================================================================
Closing since the fix has been merged to master branch. Please correct me if I'm wrong, @tjruwase.
@clumsy, I don't believe this was merged. Can you please point me to merged PR? Thanks!
@clumsy, I don't believe this was merged. Can you please point me to merged PR? Thanks!
My bad, @tjruwase, got confused with the previous PR, then we need to reopen this issue.
Sorry, @tjruwase, my new PR ruined this one, recreated this one from scratch (with a new branch instead this time): https://github.com/microsoft/DeepSpeed/pull/3577