Fix ci hang in torch2.7& improve ut
fix ci hang. improve the ut.
@inkcherry, thanks for the quick PR. I have a few questions
- It seems this PR is a workaround using
reuse_dist_env=Falserather than fixing autotp itself. Is this correct? - Do you know why
world_sizeaffects the hang? - Can you confirm that
set_autotp_mode(training=False)did not affect the hang in your environment?
@inkcherry, thanks for the quick PR. I have a few questions
- It seems this PR is a workaround using
reuse_dist_env=Falserather than fixing autotp itself. Is this correct?- Do you know why
world_sizeaffects the hang?- Can you confirm that
set_autotp_mode(training=False)did not affect the hang in your environment?
In my environment(torch=2.7.0) , it seems that the issue is related to the DistributedTest class not AutoTP.
we could create a new test file to reproduce.
from unit.common import DistributedTest
import pytest
@pytest.mark.parametrize("tp_size", [2, 4])
class TestTpDataloaderCorrectness(DistributedTest):
world_size = 2
reuse_dist_env = False **# wsz=4 and reuse_dist_env=True will hang.**
def test(self, tp_size: int):
print("finished test")
return
some process(random) will hang in teardown process with wsz=4 and reuse_dist_env↓
tests/unit/common.py ->_dist_destroy(self) ->dist.destroy_process_group()
Notice that this is the only unit test using world_size > 2 together with reuse_dist_env=True, so we can temporarily work around the issue by avoiding this combination.
nv-torch-latest-v100 is currently broken.