cugraph
cugraph copied to clipboard
[BUG]: `test_triangle_count_mg` Hangs on `test_triangles`
Version
24.08
Which installation method(s) does this occur on?
Docker, Conda, Pip, Source
Describe the bug.
If I run the entire unit test with pytest --import-mode=append test_triangle_count_mg.py the test hangs once it reaches test_triangles
However, if I just run that test pytest --import-mode=append test_triangle_count_mg.py::test_triangles all 4 tests pass.
I have been able to reproduce this on dgx18 with 2-GPUs and 8-GPUs
Minimum reproducible example
cd cugraph/python/cugraph/cugraph/tests/community/
pytest --import-mode=append test_triangle_count_mg.py
Relevant log output
06/25/24-09:36:30.353826845_UTC>>>> NODE 0: ******** STARTING TESTS FROM: tests/community/test_triangle_count_mg.py, using 2 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=1 min_time=0.000005 max_time=0 calibration_precision=10 warmup=False warmup_iterations=100000)
rapids_pytest_benchmark: 0.0.15
rootdir: /root/cugraph/python/cugraph
configfile: pytest.ini
plugins: benchmark-4.0.0, cov-5.0.0, rapids-pytest-benchmark-0.0.15
collecting ... collected 8 items
tests/community/test_triangle_count_mg.py::test_sg_triangles[True-dataset0] [1719308199.470541] [rno1-m02-d01-dgx1-054:3448418:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719308199.470541] [rno1-m02-d01-dgx1-054:3448418:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719308199.470236] [rno1-m02-d01-dgx1-054:3448421:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719308199.470236] [rno1-m02-d01-dgx1-054:3448421:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
Dask client/cluster created using LocalCUDACluster
PASSED
tests/community/test_triangle_count_mg.py::test_sg_triangles[True-dataset1] PASSED
tests/community/test_triangle_count_mg.py::test_sg_triangles[False-dataset0] PASSED
tests/community/test_triangle_count_mg.py::test_sg_triangles[False-dataset1] PASSED
tests/community/test_triangle_count_mg.py::test_triangles[True-dataset0] 2024-06-25 02:46:36,775 - distributed.scheduler - WARNING - Worker failed to heartbeat for 573s; attempting restart: <WorkerState 'tcp://127.0.0.1:33497', name: 0, status: running, memory: 1, processing: 1>
2024-06-25 02:46:36,776 - distributed.scheduler - WARNING - Worker failed to heartbeat for 573s; attempting restart: <WorkerState 'tcp://127.0.0.1:36725', name: 1, status: running, memory: 1, processing: 1>
2024-06-25 02:46:40,781 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-06-25 02:46:40,781 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-06-25 02:46:41,504 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:33497' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-1a8448be-f346-4fb3-8c1a-d30e8927ceea'} (stimulus_id='handle-worker-cleanup-1719308801.5040765')
2024-06-25 02:46:41,615 - distributed.nanny - WARNING - Restarting worker
2024-06-25 02:46:41,678 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:36725' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-fe0f6523-9050-4b79-a59d-27b05ae06c06'} (stimulus_id='handle-worker-cleanup-1719308801.6782262')
2024-06-25 02:46:41,771 - distributed.nanny - WARNING - Restarting worker
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
2024-06-25 03:36:32,399 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-06-25 03:36:32,399 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
target(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
asyncio_run(run(), loop_factory=get_loop_factory())
File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
KeyboardInterrupt
File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
target(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
asyncio_run(run(), loop_factory=get_loop_factory())
File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/tornado/ioloop.py", line 759, in _run_callback
ret = gen.convert_yielded(ret)
File "/opt/conda/lib/python3.10/functools.py", line 884, in wrapper
def wrapper(*args, **kw):
KeyboardInterrupt
sys:1: RuntimeWarning: coroutine 'PeriodicCallback._run' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
2024-06-25 03:37:01,016 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33497 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:33497 after 30 s
2024-06-25 03:37:01,017 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:36725 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:36725 after 30 s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/conda/lib/python3.10/threading.py:324: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 546, in connect
stream = await self.client.connect(
File "/opt/conda/lib/python3.10/site-packages/tornado/tcpclient.py", line 279, in connect
af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 342, in connect
comm = await wait_for(
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 1961, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/pytest", line 10, in <module>
sys.exit(console_main())
File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 206, in console_main
code = main()
File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 178, in main
ret: Union[ExitCode, int] = config.hook.pytest_cmdline_main(
File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
raise exception.with_traceback(exception.__traceback__)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
res = hook_impl.function(*args)
File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 332, in pytest_cmdline_main
return wrap_session(config, _main)
File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 320, in wrap_session
config.hook.pytest_sessionfinish(
File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 182, in _multicall
return outcome.get_result()
File "/opt/conda/lib/python3.10/site-packages/pluggy/_result.py", line 100, in get_result
raise exc.with_traceback(exc.__traceback__)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/logging.py", line 872, in pytest_sessionfinish
return (yield)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/terminal.py", line 867, in pytest_sessionfinish
result = yield
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/warnings.py", line 140, in pytest_sessionfinish
return (yield)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
res = hook_impl.function(*args)
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 110, in pytest_sessionfinish
session._setupstate.teardown_exact(None)
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 557, in teardown_exact
raise exceptions[0]
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 546, in teardown_exact
fin()
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1023, in finish
raise exceptions[0]
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1012, in finish
fin()
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 896, in _teardown_yield_fixture
next(it)
File "/root/cugraph/python/cugraph/cugraph/tests/conftest.py", line 52, in dask_client
stop_dask_client(dask_client, dask_cluster)
File "/opt/conda/lib/python3.10/site-packages/cugraph/testing/mg_utils.py", line 178, in stop_dask_client
Comms.destroy()
File "/opt/conda/lib/python3.10/site-packages/cugraph/dask/comms/comms.py", line 216, in destroy
__instance.destroy()
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
self.client.run(
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 2991, in run
return self.sync(
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 358, in sync
return sync(
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 434, in sync
raise error
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 408, in f
result = yield future
File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
value = future.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 2896, in _run
raise exc
File "/opt/conda/lib/python3.10/site-packages/distributed/scheduler.py", line 6458, in send_message
comm = await self.rpc.connect(addr)
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1674, in connect
return connect_attempt.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1564, in _connect
comm = await connect(
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:33497 after 30 s
Exception ignored in: <function Comms.__del__ at 0x15089c0e4670>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 2991, in run
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 358, in sync
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 425, in sync
File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 227, in add_callback
AttributeError: 'NoneType' object has no attribute 'get_running_loop'
06/25/24-10:37:02.265345794_UTC>>>> ERROR: command timed out after 3600 seconds
06/25/24-10:37:02.266951355_UTC>>>> NODE 0: pytest exited with code: 124, run-py-tests.sh overall exit code is: 124
06/25/24-10:37:02.369564140_UTC>>>> NODE 0: remaining python processes: [ 3437613 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
06/25/24-10:37:02.403092539_UTC>>>> NODE 0: remaining dask processes: [ ]
Environment details
Using nightly containers from here:
gitlab-master.nvidia.com:5005/rratzel/cugraph-testing/cugraph_testing-24.08
Other/Misc.
No response
Code of Conduct
- [X] I agree to follow cuGraph's Code of Conduct
- [X] I have searched the open bugs and have found no duplicates for this bug report