DeepSpeed Fix ci hang in torch2.7& improve ut

fix ci hang. improve the ut.

May 30 '25 02:05 inkcherry

@inkcherry, thanks for the quick PR. I have a few questions

It seems this PR is a workaround using reuse_dist_env=False rather than fixing autotp itself. Is this correct?
Do you know why world_size affects the hang?
Can you confirm that set_autotp_mode(training=False) did not affect the hang in your environment?

May 30 '25 11:05 sfc-gh-truwase

@inkcherry, thanks for the quick PR. I have a few questions

It seems this PR is a workaround using reuse_dist_env=False rather than fixing autotp itself. Is this correct?

Do you know why world_size affects the hang?

Can you confirm that set_autotp_mode(training=False) did not affect the hang in your environment?

In my environment（torch=2.7.0) , it seems that the issue is related to the DistributedTest class not AutoTP.

we could create a new test file to reproduce.

from unit.common import DistributedTest
import pytest
@pytest.mark.parametrize("tp_size", [2, 4])
class TestTpDataloaderCorrectness(DistributedTest):
    world_size = 2
    reuse_dist_env = False **# wsz=4 and reuse_dist_env=True will hang.**

    def test(self, tp_size: int):
        print("finished test")
        return

some process(random) will hang in teardown process with wsz=4 and reuse_dist_env↓ tests/unit/common.py ->_dist_destroy(self) ->dist.destroy_process_group()

Notice that this is the only unit test using world_size > 2 together with reuse_dist_env=True, so we can temporarily work around the issue by avoiding this combination.

May 30 '25 13:05 inkcherry

nv-torch-latest-v100 is currently broken.

Jun 02 '25 19:06 sfc-gh-truwase