cugraph icon indicating copy to clipboard operation
cugraph copied to clipboard

[BUG]: Segmentation fault when initializing the comms

Open jnke2016 opened this issue 5 months ago • 1 comments

Version

24.10

Which installation method(s) does this occur on?

No response

Describe the bug.

A user reported a segmentation fault when initializing the comms while using one of our latest nightlies. This bug is not currently reproducible by any of our nightly tests

Minimum reproducible example

Not reproducible yet

Relevant log output

stcomp>():222] - 2024-08-25 11:15:35,660 - distributed.core - INFO - Starting established connection to tcp://10.174.164.228:43037 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,812 - distributed.worker - INFO - Run out-of-band function '_get_nvml_device_index' 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,812 - distributed.worker - INFO - Run out-of-band function '_get_nvml_device_index' 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,835 - distributed.worker - INFO - Run out-of-band function '_func_ucp_listener_port' 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,835 - distributed.worker - INFO - Run out-of-band function '_func_ucp_listener_port' 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,881 - distributed.worker - INFO - Run out-of-band function '_func_init_all' 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,881 - distributed.worker - INFO - Run out-of-band function '_func_init_all' 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:30,551 - distributed.worker - INFO - Run out-of-band function '_subcomm_init' 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:30,585 - distributed.worker - INFO - Run out-of-band function '_subcomm_init' 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - [1724584582.860222] [cgjben2-a6wj25tiqsi5u-w-10:8866 :0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?) 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - [1724584582.860222] [cgjben2-a6wj25tiqsi5u-w-10:8866 :0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning) 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - [cgjben2-a6wj25tiqsi5u-w-10:8866 :0:8866] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - ==== backtrace (tid: 8866) ==== 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 0 /mnt/1/python_env/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(ucs_handle_error+0x2fd) [0x7f1228e5a06d] 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 1 /mnt/1/python_env/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2a264) [0x7f1228e5a264] 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2 /mnt/1/python_env/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2a42a) [0x7f1228e5a42a] 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f129dad7980] 24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - =================================

Environment details

No response

Other/Misc.

No response

Code of Conduct

  • [x] I agree to follow cuGraph's Code of Conduct
  • [x] I have searched the open bugs and have found no duplicates for this bug report

jnke2016 avatar Aug 27 '24 18:08 jnke2016