cugraph
                                
                                
                                
                                    cugraph copied to clipboard
                            
                            
                            
                        `bench_louvain` Failing in `bench_algos.py` Due to `CommClosedError`
Version
24.08
Which installation method(s) does this occur on?
Conda, Source
Describe the bug.
The Louvain Algorithm being run in cugraph/benchmarks/cugraph/pytest-based/bench_algos.py is failing due to a ConnectionRefusedError
Minimum reproducible example
pytest -v --import-mode=append bench_algos.py::bench_louvain
Relevant log output
07/17/24-11:32:55.328165306_UTC>>>> NODE 0: ******** STARTING BENCHMARK FROM: ./bench_algos.py::bench_louvain, using 8 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
rapids_pytest_benchmark: 0.0.15
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /root/cugraph/benchmarks
configfile: pytest.ini
plugins: rapids-pytest-benchmark-0.0.15, benchmark-4.0.0, cov-5.0.0
collecting ... collected 720 items / 719 deselected / 1 selected
bench_algos.py::bench_louvain[ds:rmat_mg_20_16-mm:False-pa:True] [1721215988.622395] [rno1-m02-c08-dgx1-048:3618260:0]            sock.c:470  UCX  ERROR bind(fd=141 addr=0.0.0.0:37111) failed: Address already in use
Dask client/cluster created using LocalCUDACluster
2024-07-17 05:32:55,418 - distributed.worker - WARNING - Scheduler was unaware of this worker; shutting down.
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Not saving anything, no benchmarks have been run!
  warner(PytestBenchmarkWarning(text))
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2024-07-17 05:32:57,433 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2024-07-17 05:32:57,436 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Process Dask Worker process (from Nanny):
2024-07-17 05:32:57,440 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Process Dask Worker process (from Nanny):
2024-07-17 05:32:57,444 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Process Dask Worker process (from Nanny):
2024-07-17 05:32:57,449 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
2024-07-17 05:32:57,456 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-07-17 05:32:57,456 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/conda/lib/python3.10/threading.py:324: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x14909852d9f0>: ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/opt/conda/bin/pytest", line 10, in <module>
    sys.exit(console_main())
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 206, in console_main
    code = main()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 178, in main
    ret: Union[ExitCode, int] = config.hook.pytest_cmdline_main(
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 332, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 320, in wrap_session
    config.hook.pytest_sessionfinish(
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/logging.py", line 872, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/terminal.py", line 867, in pytest_sessionfinish
    result = yield
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/warnings.py", line 140, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 110, in pytest_sessionfinish
    session._setupstate.teardown_exact(None)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 557, in teardown_exact
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 546, in teardown_exact
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1023, in finish
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1012, in finish
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 896, in _teardown_yield_fixture
    next(it)
  File "/root/cugraph/benchmarks/cugraph/pytest-based/bench_algos.py", line 230, in dataset
    mg_utils.stop_dask_client(client, cluster)
  File "/opt/conda/lib/python3.10/site-packages/cugraph/testing/mg_utils.py", line 178, in stop_dask_client
    Comms.destroy()
  File "/opt/conda/lib/python3.10/site-packages/cugraph/dask/comms/comms.py", line 216, in destroy
    __instance.destroy()
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
    self.client.run(
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3074, in run
    return self.sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 364, in sync
    return sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 440, in sync
    raise error
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 414, in f
    result = yield future
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 2979, in _run
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/distributed/scheduler.py", line 6527, in send_message
    comm = await self.rpc.connect(addr)
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1677, in connect
    return connect_attempt.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1567, in _connect
    comm = await connect(
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:33921 after 30 s
Exception ignored in: <function Comms.__del__ at 0x14920f451f30>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3074, in run
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 364, in sync
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 431, in sync
  File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 227, in add_callback
AttributeError: 'NoneType' object has no attribute 'get_running_loop'
07/17/24-12:33:27.207025649_UTC>>>> ERROR: command timed out after 3600 seconds
07/17/24-12:33:27.208473673_UTC>>>> NODE 0: pytest exited with code: 124, run-py-tests.sh overall exit code is: 124
07/17/24-12:33:27.325919843_UTC>>>> NODE 0: remaining python processes: [ 3612421 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
07/17/24-12:33:27.350685725_UTC>>>> NODE 0: remaining dask processes: [  ]
Environment details
Being run inside the nightly cugraph MNMG testing containers on draco-rno
Other/Misc.
Code of Conduct
- [X] I agree to follow cuGraph's Code of Conduct
 - [X] I have searched the open bugs and have found no duplicates for this bug report