MONAI icon indicating copy to clipboard operation
MONAI copied to clipboard

Test error: Distributed call failed in min-dep-os

Open mingxin-zheng opened this issue 2 years ago • 3 comments

Describe the bug

/Users/runner/work/MONAI/MONAI/monai/transforms/io/array.py:213: UserWarning: required package for reader PILReader is not installed, or the version doesn't match requirement.
Traceback (most recent call last):
  warnings.warn(
  File "/Users/runner/work/MONAI/MONAI/tests/utils.py", line 541, in _wrapper
/Users/runner/work/MONAI/MONAI/monai/transforms/io/array.py:213: UserWarning: required package for reader ITKReader is not installed, or the version doesn't match requirement.
  warnings.warn(
/Users/runner/work/MONAI/MONAI/monai/transforms/io/array.py:213: UserWarning: required package for reader NrrdReader is not installed, or the version doesn't match requirement.
  warnings.warn(
/Users/runner/work/MONAI/MONAI/monai/transforms/io/array.py:213: UserWarning: required package for reader PydicomReader is not installed, or the version doesn't match requirement.
  warnings.warn(
/Users/runner/work/MONAI/MONAI/monai/transforms/utils.py:561: UserWarning: Num foregrounds 27, Num backgrounds 0, unable to generate class balanced samples, setting `pos_ratio` to 1.
  warnings.warn(
    assert results.get(), "Distributed call failed."
AssertionError: Distributed call failed.

To Reproduce

https://github.com/Project-MONAI/MONAI/actions/runs/5455742504/jobs/9927617836?pr=6623

Expected behavior

The test should pass.

Add any other context about the problem here.

mingxin-zheng avatar Jul 04 '23 14:07 mingxin-zheng

root cause seems to be the github ci runner

test_even (tests.test_sampler_dist.DistributedSamplerTest) ... ok
Process SpawnProcess-80:
Traceback (most recent call last):
  File "/Users/runner/work/MONAI/MONAI/tests/utils.py", line 505, in run_process
    raise e
  File "/Users/runner/work/MONAI/MONAI/tests/utils.py", line 489, in run_process
    dist.init_process_group(
  File "/Users/runner/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
  File "/Users/runner/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
    backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /Users/runner/work/pytorch/pytorch/pytorch/third_party/gloo/gloo/transport/uv/device.cc:153] rp != nullptr. Unable to find address for: Mac-1688480011779.local

wyli avatar Jul 04 '23 15:07 wyli

Should we have any next steps?

mingxin-zheng avatar Jul 05 '23 06:07 mingxin-zheng

Let's keep this open, currently in most cases manually rerunning the pipelines clears the error. if it's becoming frequent we can remove the multiprocess tests on macos.

wyli avatar Jul 05 '23 06:07 wyli