DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Fix ci hang in torch2.7& improve ut

Open inkcherry opened this issue 6 months ago • 2 comments

fix ci hang. improve the ut.

inkcherry avatar May 30 '25 02:05 inkcherry

@inkcherry, thanks for the quick PR. I have a few questions

  1. It seems this PR is a workaround using reuse_dist_env=False rather than fixing autotp itself. Is this correct?
  2. Do you know why world_size affects the hang?
  3. Can you confirm that set_autotp_mode(training=False) did not affect the hang in your environment?

sfc-gh-truwase avatar May 30 '25 11:05 sfc-gh-truwase

@inkcherry, thanks for the quick PR. I have a few questions

  1. It seems this PR is a workaround using reuse_dist_env=False rather than fixing autotp itself. Is this correct?
  2. Do you know why world_size affects the hang?
  3. Can you confirm that set_autotp_mode(training=False) did not affect the hang in your environment?

In my environment(torch=2.7.0) , it seems that the issue is related to the DistributedTest class not AutoTP.

we could create a new test file to reproduce.

from unit.common import DistributedTest
import pytest
@pytest.mark.parametrize("tp_size", [2, 4])
class TestTpDataloaderCorrectness(DistributedTest):
    world_size = 2
    reuse_dist_env = False **# wsz=4 and reuse_dist_env=True will hang.**

    def test(self, tp_size: int):
        print("finished test")
        return 

some process(random) will hang in teardown process with wsz=4 and reuse_dist_env↓ tests/unit/common.py ->_dist_destroy(self) ->dist.destroy_process_group()

Notice that this is the only unit test using world_size > 2 together with reuse_dist_env=True, so we can temporarily work around the issue by avoiding this combination.

inkcherry avatar May 30 '25 13:05 inkcherry

nv-torch-latest-v100 is currently broken.

sfc-gh-truwase avatar Jun 02 '25 19:06 sfc-gh-truwase