cugraph+ pytorch does not always cleanup cleanly
Describe the bug
When we import pytorch before cugraph we sometimes throw a segfault during cleanup. We should fix that.
Steps/Code to reproduce bug
import torch
import cugraph
import cudf
src_t = torch.IntTensor([0,1,2,3])
dst_t = torch.IntTensor([1,2,3,4])
src_dlpack = torch.utils.dlpack.to_dlpack(src_t)
dst_dlpack = torch.utils.dlpack.to_dlpack(dst_t)
src_ser = cudf.from_dlpack(src_dlpack)
dst_ser = cudf.from_dlpack(dst_dlpack)
df = cudf.DataFrame({'src':src_ser,'dst':dst_ser})
G = cugraph.Graph()
G.from_cudf_edgelist(df, source='src', destination='dst')
print(cugraph.katz_centrality(G))
del G
#!/bin/bash
for i in {0..10..1}
do
echo "Loop $i"
python3 minimal_example.py
done
Trace
............
Loop 1
katz_centrality vertex
0 0.480096 1
1 0.500100 2
2 0.480096 3
3 0.380076 0
4 0.380076 4
[92d8de5fb2b6:4503 :0:4503] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 4503) ====
0 /opt/conda/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x2fd) [0x7f2b5df69b1d]
1 /opt/conda/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2bd24) [0x7f2b5df69d24]
2 /opt/conda/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2beea) [0x7f2b5df69eea]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f2cb2672980]
4 /opt/conda/lib/python3.9/site-packages/cugraph/community/../../../../libcugraph.so(_ZNSt10_HashtableINSt6thread2idESt4pairIKS1_St8weak_ptrIN4raft13interruptibleEEESaIS8_ENSt8__detail10_Select1stESt8equal_toIS1_ESt4hashIS1_ENSA_18_Mod_range_hashingENSA_20_Default_ranged_hashENSA_20_Prime_rehash_policyENSA_17_Hashtable_traitsILb0ELb0ELb1EEEE4findERS3_+0x3c) [0x7f2bd15d62fc]
5 /opt/conda/lib/python3.9/site-packages/cugraph/community/../../../../libcugraph.so(_ZNSt19_Sp_counted_deleterIPN4raft13interruptibleEZNS1_14get_token_implILb1EEESt10shared_ptrIS1_ENSt6thread2idEEUlT_E_SaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x41) [0x7f2bd15d76d1]
6 /opt/conda/lib/python3.9/site-packages/cugraph/community/../../../../libcugraph.so(_ZNSt10shared_ptrIN4raft13interruptibleEED1Ev+0x50) [0x7f2bd15c9620]
7 /opt/conda/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6(+0xaef3c) [0x7f2c7c5bcf3c]
8 /lib/x86_64-linux-gnu/libc.so.6(+0x43031) [0x7f2cb1905031]
9 /lib/x86_64-linux-gnu/libc.so.6(+0x4312a) [0x7f2cb190512a]
10 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7f2cb18e3c8e]
11 python3(+0x1d9a81) [0x5605e5852a81]
=================================
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Additional context
See discussion here: https://github.com/dmlc/dgl/issues/4324#issuecomment-1255682096
CC: @rlratzel , @ChuckHastings
Is this a fix for cugraph or pytorch?
Is this a fix for cugraph or pytorch?
Unkown honestly.
Is this still reproduceable given latest updates?
Any updates on this? We noticed this as well in NeMo Curator here. Happy to help out in any way I can.
Sorry for letting this get pushed aside for so long. I just ran this in my environment and I'm not able to reproduce the segfault using the .py file and shell script above (modified to run for 100 iterations and exit immediately on error), with pytorch 2.0.1 and 24.04 nightlies of cugraph and cudf:
...
Loop 100
/opt/conda/lib/python3.10/site-packages/cugraph/structure/symmetrize.py:93: FutureWarning: Multi is deprecated and the removal of multi edges will no longer be supported from 'symmetrize'. Multi edges will be
removed upon creation of graph instance.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/cugraph/centrality/katz_centrality.py:121: UserWarning: Katz centrality expects the 'store_transposed' flag to be set to 'True' for optimal performance during the graph
creation
warnings.warn(warning_msg, UserWarning)
vertex katz_centrality
0 1 0.480096
1 2 0.500100
2 3 0.480096
3 0 0.380076
4 4 0.380076
user@machine:/# conda list pytorch
# packages in environment at /opt/conda:
#
# Name Version Build Channel
pytorch 2.0.1 py3.10_cuda11.8_cudnn8.7.0_0 pytorch
pytorch-cuda 11.8 h7e8668a_5 pytorch
pytorch-mutex 1.0 cuda pytorch
@alexbarghi-nv are you able to reproduce the segfault?
@ryantwolf , Can you confirm the Pytorch/cugraph version you were seeing for this issue ?
Yeah sure. I installed NeMo-Curator in the Nvidia PyTorch container nvcr.io/nvidia/pytorch:24.03-py3.
Then, I am able to then reproduce by running
import torch
import cugraph
exit()
In the Python console. Note: This error only occurred after installing NeMo Curator, so something in the versions or some version mismatch is likely the cause.
Doing pip freeze | grep cugraph gives me this:
cugraph @ file:///rapids/cugraph-24.2.0-cp310-cp310-manylinux_2_35_x86_64.whl#sha256=0e9fe553af604d6386eb741787e4c0025394c29b167cbc2c91935046b8df0a31
cugraph-cu12==24.2.0
cugraph-dgl @ file:///rapids/cugraph_dgl-24.2.0-py3-none-any.whl#sha256=b2250d51c6a26d7e3ec9c67bc464359eec0bd67fece78490b6796bae945acdd4
cugraph-service-client @ file:///rapids/cugraph_service_client-24.2.0-py3-none-any.whl#sha256=5db9834b551245fe0754a46ad400031c50ae5117d5614b8cb138bd4359e98288
cugraph-service-server @ file:///rapids/cugraph_service_server-24.2.0-py3-none-any.whl#sha256=2d7bd8647bb2a8bb4a882d6d4bbc83dc536337900fd020dcd4fc9be25cf5dae6
pylibcugraph @ file:///rapids/pylibcugraph-24.2.0-cp310-cp310-manylinux_2_35_x86_64.whl#sha256=1c515e62130352cb1a0f6cfd45451722024ba503174c6ac81901bf8ec884610c
pylibcugraph-cu12==24.2.0
pylibcugraphops @ file:///rapids/pylibcugraphops-24.2.0-cp310-cp310-linux_x86_64.whl#sha256=4e7f7ced3e276f9c6c11483d3cfa849a247e6d52184c287d9c3bebec9a2dd497
And doing pip freeze | grep torch gives me this:
apex @ file:///opt/pytorch/apex
cudnn @ file:///opt/pytorch/pytorch/third_party/cudnn_frontend
onnx @ file:///opt/pytorch/pytorch/third_party/onnx
pytorch-lightning==2.0.7
pytorch-quantization==2.1.2
pytorch-triton @ file:///tmp/dist/pytorch_triton-2.2.0%2Be28a256d7-cp310-cp310-linux_x86_64.whl#sha256=6af05ee3a40681a8e1cd45f69eb3653eca7c0d03c07d406065f7ae8e1c38f7d6
torch @ file:///opt/transfer/torch-2.3.0a0%2B40ec155e58.nv24.3-cp310-cp310-linux_x86_64.whl#sha256=5230e3e37d2347e82daa6e1ccc2a5eeb9d8d673206f72d811339ad410d1596ad
torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/dist/torch_tensorrt-2.3.0a0-cp310-cp310-linux_x86_64.whl#sha256=905a9df8a2a360ac719ed8ff72c14fb6e3f807e60f5652025ad382ef08d009b7
torchdata @ file:///opt/pytorch/data
torchmetrics==1.3.2
torchtext @ file:///opt/pytorch/text
torchvision @ file:///opt/pytorch/vision
Let me know if you want me to check the versions of other packages.