cugraph
cugraph copied to clipboard
[BUG]: `test_k_truss_subgraph` Memory Error on 2-GPUs
Version
24.10
Which installation method(s) does this occur on?
Conda
Describe the bug.
When running tests/test_k_truss_subgraph_mg.py
on 2-GPU on draco-rno
, the test encounters a memory error that causes it to fail.
Minimum reproducible example
pytest -v --import-mode=append test_k_truss_subgraph_mg.py
Relevant log output
08/19/24-09:31:10.033434924_UTC>>>> NODE 0: ******** STARTING TESTS FROM: tests/community/test_k_truss_subgraph_mg.py, using 2 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
rapids_pytest_benchmark: 0.0.15
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=1 min_time=0.000005 max_time=0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /root/cugraph/python/cugraph
configfile: pytest.ini
plugins: cov-5.0.0, rapids-pytest-benchmark-0.0.15, benchmark-4.0.0
collecting ... collected 18 items
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-True-dataset0]
Dask client/cluster created using LocalCUDACluster
PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-True-dataset1] PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-True-dataset2] PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-False-dataset0] PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-False-dataset1] PASSED
tests/community/test_k_truss_subgraph_mg.py::test_mg_ktruss_subgraph[4-False-dataset2] [rno1-m02-f01-dgx1-116:3535017:0:3535158] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[rno1-m02-f01-dgx1-116:3535020:0:3535160] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:3535158) ====
0 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(ucs_handle_error+0x2fd) [0x14eafd95dcfd]
1 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2def4) [0x14eafd95def4]
2 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2e0ba) [0x14eafd95e0ba]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x14eb71edf520]
4 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x55c60) [0x14ead105dc60]
5 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3217e) [0x14ead103a17e]
6 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x345bf) [0x14ead103c5bf]
7 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x38e73) [0x14ead1040e73]
8 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3bfc5) [0x14ead1043fc5]
9 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3d183) [0x14ead1045183]
10 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(ncclGroupEnd+0x6a) [0x14ead104592a]
11 /opt/conda/lib/python3.10/site-packages/raft_dask/common/comms_utils.cpython-310-x86_64-linux-gnu.so(+0x32573) [0x14eafdb2b573]
12 /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(+0x348a069) [0x14e9e43ef069]
13 /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph6detail24edge_triangle_count_implIiiLb0ELb1EEENS_15edge_property_tINS_12graph_view_tIT_T0_Lb0EXT2_EvEES5_EERKN4raft8handle_tERKNS3_IS4_S5_XT1_EXT2_EvEE+0x86f) [0x14e9e43f6bef]
14 /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph19edge_triangle_countIiiLb1EEENS_15edge_property_tINS_12graph_view_tIT_T0_Lb0EXT1_EvEES4_EERKN4raft8handle_tERKS5_+0xa) [0x14e9e43f994a]
15 /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph7k_trussIiifLb1EEESt5tupleIJN3rmm14device_uvectorIT_EES5_St8optionalINS3_IT1_EEEEERKN4raft8handle_tERKNS_12graph_view_tIS4_T0_Lb0EXT2_EvEES6_INS_20edge_property_view_tISG_PKS7_N6thrust15iterator_traitsISM_E10value_typeEEEESG_b+0x10f9) [0x14e9e55d9bd9]
16 /opt/conda/lib/python3.10/site-packages/pylibcugraph/../../../libcugraph_c.so(+0x1e0b4f) [0x14e958121b4f]
17 /opt/conda/lib/python3.10/site-packages/pylibcugraph/../../../libcugraph_c.so(cugraph_k_truss_subgraph+0xde) [0x14e9581297ae]
18 /opt/conda/lib/python3.10/site-packages/pylibcugraph/k_truss_subgraph.cpython-310-x86_64-linux-gnu.so(+0x6cde) [0x14eaad202cde]
19 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x13ca) [0x560dda68c8fa]
20 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
21 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x560dda68e2b3]
22 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
23 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x320) [0x560dda68b850]
24 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
25 /opt/conda/bin/python3.10(+0x25f60c) [0x560dda7b660c]
26 /opt/conda/bin/python3.10(+0xfdd90) [0x560dda654d90]
27 /opt/conda/bin/python3.10(+0x13c2a3) [0x560dda6932a3]
28 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x5cd5) [0x560dda691205]
29 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
30 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x560dda68e2b3]
31 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
32 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x560dda68bc5c]
33 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
34 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x560dda68e2b3]
35 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
36 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x560dda68bc5c]
37 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560dda69ba2c]
38 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x560dda68bc5c]
39 /opt/conda/bin/python3.10(+0x150804) [0x560dda6a7804]
40 /opt/conda/bin/python3.10(+0x228372) [0x560dda77f372]
41 /opt/conda/bin/python3.10(+0x228324) [0x560dda77f324]
42 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x14eb71f31ac3]
43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x14eb71fc3850]
=================================
==== backtrace (tid:3535160) ====
0 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(ucs_handle_error+0x2fd) [0x154101c86cfd]
1 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2def4) [0x154101c86ef4]
2 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2e0ba) [0x154101c870ba]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x15416c0f9520]
4 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x55c60) [0x1540c905dc60]
5 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3217e) [0x1540c903a17e]
6 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x345bf) [0x1540c903c5bf]
7 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x38e73) [0x1540c9040e73]
8 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3bfc5) [0x1540c9043fc5]
9 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(+0x3d183) [0x1540c9045183]
10 /opt/conda/lib/python3.10/site-packages/raft_dask/common/../../../../libnccl.so.2(ncclGroupEnd+0x6a) [0x1540c904592a]
11 /opt/conda/lib/python3.10/site-packages/raft_dask/common/comms_utils.cpython-310-x86_64-linux-gnu.so(+0x32573) [0x154101e92573]
12 /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(+0x348a069) [0x153fdea7e069]
13 /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph6detail24edge_triangle_count_implIiiLb0ELb1EEENS_15edge_property_tINS_12graph_view_tIT_T0_Lb0EXT2_EvEES5_EERKN4raft8handle_tERKNS3_IS4_S5_XT1_EXT2_EvEE+0x86f) [0x153fdea85bef]
14 /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph19edge_triangle_countIiiLb1EEENS_15edge_property_tINS_12graph_view_tIT_T0_Lb0EXT1_EvEES4_EERKN4raft8handle_tERKS5_+0xa) [0x153fdea8894a]
15 /opt/conda/lib/python3.10/site-packages/cugraph/structure/../../../../libcugraph.so(_ZN7cugraph7k_trussIiifLb1EEESt5tupleIJN3rmm14device_uvectorIT_EES5_St8optionalINS3_IT1_EEEEERKN4raft8handle_tERKNS_12graph_view_tIS4_T0_Lb0EXT2_EvEES6_INS_20edge_property_view_tISG_PKS7_N6thrust15iterator_traitsISM_E10value_typeEEEESG_b+0x10f9) [0x153fdfc68bd9]
16 /opt/conda/lib/python3.10/site-packages/pylibcugraph/../../../libcugraph_c.so(+0x1e0b4f) [0x153f52220b4f]
17 /opt/conda/lib/python3.10/site-packages/pylibcugraph/../../../libcugraph_c.so(cugraph_k_truss_subgraph+0xde) [0x153f522287ae]
18 /opt/conda/lib/python3.10/site-packages/pylibcugraph/k_truss_subgraph.cpython-310-x86_64-linux-gnu.so(+0x6cde) [0x1540a2db9cde]
19 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x13ca) [0x56501c88e8fa]
20 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
21 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x56501c8902b3]
22 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
23 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x320) [0x56501c88d850]
24 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
25 /opt/conda/bin/python3.10(+0x25f60c) [0x56501c9b860c]
26 /opt/conda/bin/python3.10(+0xfdd90) [0x56501c856d90]
27 /opt/conda/bin/python3.10(+0x13c2a3) [0x56501c8952a3]
28 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x5cd5) [0x56501c893205]
29 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
30 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x56501c8902b3]
31 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
32 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x56501c88dc5c]
33 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
34 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x2d83) [0x56501c8902b3]
35 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
36 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x56501c88dc5c]
37 /opt/conda/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x56501c89da2c]
38 /opt/conda/bin/python3.10(_PyEval_EvalFrameDefault+0x72c) [0x56501c88dc5c]
39 /opt/conda/bin/python3.10(+0x150804) [0x56501c8a9804]
40 /opt/conda/bin/python3.10(+0x228372) [0x56501c981372]
41 /opt/conda/bin/python3.10(+0x228324) [0x56501c981324]
42 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x15416c14bac3]
43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x15416c1dd850]
=================================
2024-08-19 02:33:04,329 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:44047' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-9385f2a5-2e1f-4f4f-99c9-556a0d63fd42'} (stimulus_id='handle-worker-cleanup-1724059984.328924')
2024-08-19 02:33:04,424 - distributed.nanny - WARNING - Restarting worker
2024-08-19 02:33:04,487 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:46495' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-ea9ace46-435d-45b8-bbe5-7f0ba731f9de'} (stimulus_id='handle-worker-cleanup-1724059984.487717')
2024-08-19 02:33:04,584 - distributed.nanny - WARNING - Restarting worker
2024-08-19 02:46:12,072 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-08-19 02:46:12,072 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
target(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1015, in _run
asyncio_run(run(), loop_factory=get_loop_factory())
File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
target(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1015, in _run
asyncio_run(run(), loop_factory=get_loop_factory())
File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2024-08-19 02:46:40,513 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:44047 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:44047 after 30 s
2024-08-19 02:46:40,514 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:46495 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:46495 after 30 s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/conda/lib/python3.10/threading.py:324: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x145f18c168c0>: ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/pytest", line 10, in <module>
sys.exit(console_main())
File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 201, in console_main
code = main()
File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175, in main
ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
raise exception.with_traceback(exception.__traceback__)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
res = hook_impl.function(*args)
File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 330, in pytest_cmdline_main
return wrap_session(config, _main)
File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 318, in wrap_session
config.hook.pytest_sessionfinish(
File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 182, in _multicall
return outcome.get_result()
File "/opt/conda/lib/python3.10/site-packages/pluggy/_result.py", line 100, in get_result
raise exc.with_traceback(exc.__traceback__)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/logging.py", line 870, in pytest_sessionfinish
return (yield)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/terminal.py", line 893, in pytest_sessionfinish
result = yield
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/warnings.py", line 141, in pytest_sessionfinish
return (yield)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
res = hook_impl.function(*args)
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 107, in pytest_sessionfinish
session._setupstate.teardown_exact(None)
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 557, in teardown_exact
raise exceptions[0]
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 546, in teardown_exact
fin()
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1031, in finish
raise exceptions[0]
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1020, in finish
fin()
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 906, in _teardown_yield_fixture
next(it)
File "/root/cugraph/python/cugraph/cugraph/tests/conftest.py", line 52, in dask_client
stop_dask_client(dask_client, dask_cluster)
File "/opt/conda/lib/python3.10/site-packages/cugraph/testing/mg_utils.py", line 182, in stop_dask_client
Comms.destroy()
File "/opt/conda/lib/python3.10/site-packages/cugraph/dask/comms/comms.py", line 214, in destroy
__instance.destroy()
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
self.client.run(
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3192, in run
return self.sync(
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 363, in sync
return sync(
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 439, in sync
raise error
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 413, in f
result = yield future
File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
value = future.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3097, in _run
raise exc
File "/opt/conda/lib/python3.10/site-packages/distributed/scheduler.py", line 6653, in send_message
comm = await self.rpc.connect(addr)
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1535, in connect
return connect_attempt.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1425, in _connect
comm = await connect(
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:44047 after 30 s
Exception ignored in: <function Comms.__del__ at 0x146005005120>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3192, in run
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 363, in sync
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 430, in sync
File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 227, in add_callback
AttributeError: 'NoneType' object has no attribute 'get_running_loop'
08/19/24-09:46:41.556836520_UTC>>>> ERROR: command timed out after 900 seconds
08/19/24-09:46:41.557996984_UTC>>>> NODE 0: pytest exited with code: 124, run-py-tests.sh overall exit code is: 124
08/19/24-09:46:41.633951285_UTC>>>> NODE 0: remaining python processes: [ 3526387 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
08/19/24-09:46:41.657663379_UTC>>>> NODE 0: remaining dask processes: [ ]
Environment details
Running on 2-GPUs and 1-Node on draco-rno using LocalCUDACluster.
Other/Misc.
Was unable to reproduce this failure on the lab machines. Also, this failure can be seen without running the entire suite of cugraph MG tests inside an interactive slurm session.
Code of Conduct
- [X] I agree to follow cuGraph's Code of Conduct
- [X] I have searched the open bugs and have found no duplicates for this bug report