cugraph icon indicating copy to clipboard operation
cugraph copied to clipboard

cugraph+ pytorch does not always cleanup cleanly

Open VibhuJawa opened this issue 3 years ago • 7 comments

Describe the bug

When we import pytorch before cugraph we sometimes throw a segfault during cleanup. We should fix that.

Steps/Code to reproduce bug

import torch
import cugraph
import cudf


src_t = torch.IntTensor([0,1,2,3])
dst_t = torch.IntTensor([1,2,3,4])

src_dlpack = torch.utils.dlpack.to_dlpack(src_t)
dst_dlpack = torch.utils.dlpack.to_dlpack(dst_t)

src_ser = cudf.from_dlpack(src_dlpack)
dst_ser = cudf.from_dlpack(dst_dlpack)

df = cudf.DataFrame({'src':src_ser,'dst':dst_ser})
G = cugraph.Graph()
G.from_cudf_edgelist(df, source='src', destination='dst')

print(cugraph.katz_centrality(G))

del G
#!/bin/bash
for i in {0..10..1}
do
  echo "Loop $i"
  python3 minimal_example.py
done

Trace

............
Loop 1
   katz_centrality  vertex
0         0.480096       1
1         0.500100       2
2         0.480096       3
3         0.380076       0
4         0.380076       4
[92d8de5fb2b6:4503 :0:4503] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:   4503) ====
 0  /opt/conda/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x2fd) [0x7f2b5df69b1d]
 1  /opt/conda/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2bd24) [0x7f2b5df69d24]
 2  /opt/conda/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2beea) [0x7f2b5df69eea]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f2cb2672980]
 4  /opt/conda/lib/python3.9/site-packages/cugraph/community/../../../../libcugraph.so(_ZNSt10_HashtableINSt6thread2idESt4pairIKS1_St8weak_ptrIN4raft13interruptibleEEESaIS8_ENSt8__detail10_Select1stESt8equal_toIS1_ESt4hashIS1_ENSA_18_Mod_range_hashingENSA_20_Default_ranged_hashENSA_20_Prime_rehash_policyENSA_17_Hashtable_traitsILb0ELb0ELb1EEEE4findERS3_+0x3c) [0x7f2bd15d62fc]
 5  /opt/conda/lib/python3.9/site-packages/cugraph/community/../../../../libcugraph.so(_ZNSt19_Sp_counted_deleterIPN4raft13interruptibleEZNS1_14get_token_implILb1EEESt10shared_ptrIS1_ENSt6thread2idEEUlT_E_SaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x41) [0x7f2bd15d76d1]
 6  /opt/conda/lib/python3.9/site-packages/cugraph/community/../../../../libcugraph.so(_ZNSt10shared_ptrIN4raft13interruptibleEED1Ev+0x50) [0x7f2bd15c9620]
 7  /opt/conda/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6(+0xaef3c) [0x7f2c7c5bcf3c]
 8  /lib/x86_64-linux-gnu/libc.so.6(+0x43031) [0x7f2cb1905031]
 9  /lib/x86_64-linux-gnu/libc.so.6(+0x4312a) [0x7f2cb190512a]
10  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7f2cb18e3c8e]
11  python3(+0x1d9a81) [0x5605e5852a81]
=================================

Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

Additional context

See discussion here: https://github.com/dmlc/dgl/issues/4324#issuecomment-1255682096

CC: @rlratzel , @ChuckHastings

VibhuJawa avatar Sep 23 '22 01:09 VibhuJawa

Is this a fix for cugraph or pytorch?

kingmesal avatar Mar 08 '23 18:03 kingmesal

Is this a fix for cugraph or pytorch?

Unkown honestly.

VibhuJawa avatar Mar 08 '23 19:03 VibhuJawa

Is this still reproduceable given latest updates?

kingmesal avatar Mar 08 '23 19:03 kingmesal

Any updates on this? We noticed this as well in NeMo Curator here. Happy to help out in any way I can.

ryantwolf avatar Apr 03 '24 19:04 ryantwolf

Sorry for letting this get pushed aside for so long. I just ran this in my environment and I'm not able to reproduce the segfault using the .py file and shell script above (modified to run for 100 iterations and exit immediately on error), with pytorch 2.0.1 and 24.04 nightlies of cugraph and cudf:

...
Loop 100
/opt/conda/lib/python3.10/site-packages/cugraph/structure/symmetrize.py:93: FutureWarning: Multi is deprecated and the removal of multi edges will no longer be supported from 'symmetrize'. Multi edges will be
removed upon creation of graph instance.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/cugraph/centrality/katz_centrality.py:121: UserWarning: Katz centrality expects the 'store_transposed' flag to be set to 'True' for optimal performance during the graph
creation
  warnings.warn(warning_msg, UserWarning)
   vertex  katz_centrality
0       1         0.480096
1       2         0.500100
2       3         0.480096
3       0         0.380076
4       4         0.380076
user@machine:/# conda list pytorch
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
pytorch                   2.0.1           py3.10_cuda11.8_cudnn8.7.0_0    pytorch
pytorch-cuda              11.8                 h7e8668a_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch

@alexbarghi-nv are you able to reproduce the segfault?

rlratzel avatar Apr 03 '24 20:04 rlratzel

@ryantwolf , Can you confirm the Pytorch/cugraph version you were seeing for this issue ?

VibhuJawa avatar Apr 03 '24 20:04 VibhuJawa

Yeah sure. I installed NeMo-Curator in the Nvidia PyTorch container nvcr.io/nvidia/pytorch:24.03-py3. Then, I am able to then reproduce by running

import torch
import cugraph
exit()

In the Python console. Note: This error only occurred after installing NeMo Curator, so something in the versions or some version mismatch is likely the cause.

Doing pip freeze | grep cugraph gives me this:

cugraph @ file:///rapids/cugraph-24.2.0-cp310-cp310-manylinux_2_35_x86_64.whl#sha256=0e9fe553af604d6386eb741787e4c0025394c29b167cbc2c91935046b8df0a31
cugraph-cu12==24.2.0
cugraph-dgl @ file:///rapids/cugraph_dgl-24.2.0-py3-none-any.whl#sha256=b2250d51c6a26d7e3ec9c67bc464359eec0bd67fece78490b6796bae945acdd4
cugraph-service-client @ file:///rapids/cugraph_service_client-24.2.0-py3-none-any.whl#sha256=5db9834b551245fe0754a46ad400031c50ae5117d5614b8cb138bd4359e98288
cugraph-service-server @ file:///rapids/cugraph_service_server-24.2.0-py3-none-any.whl#sha256=2d7bd8647bb2a8bb4a882d6d4bbc83dc536337900fd020dcd4fc9be25cf5dae6
pylibcugraph @ file:///rapids/pylibcugraph-24.2.0-cp310-cp310-manylinux_2_35_x86_64.whl#sha256=1c515e62130352cb1a0f6cfd45451722024ba503174c6ac81901bf8ec884610c
pylibcugraph-cu12==24.2.0
pylibcugraphops @ file:///rapids/pylibcugraphops-24.2.0-cp310-cp310-linux_x86_64.whl#sha256=4e7f7ced3e276f9c6c11483d3cfa849a247e6d52184c287d9c3bebec9a2dd497

And doing pip freeze | grep torch gives me this:

apex @ file:///opt/pytorch/apex
cudnn @ file:///opt/pytorch/pytorch/third_party/cudnn_frontend
onnx @ file:///opt/pytorch/pytorch/third_party/onnx
pytorch-lightning==2.0.7
pytorch-quantization==2.1.2
pytorch-triton @ file:///tmp/dist/pytorch_triton-2.2.0%2Be28a256d7-cp310-cp310-linux_x86_64.whl#sha256=6af05ee3a40681a8e1cd45f69eb3653eca7c0d03c07d406065f7ae8e1c38f7d6
torch @ file:///opt/transfer/torch-2.3.0a0%2B40ec155e58.nv24.3-cp310-cp310-linux_x86_64.whl#sha256=5230e3e37d2347e82daa6e1ccc2a5eeb9d8d673206f72d811339ad410d1596ad
torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/dist/torch_tensorrt-2.3.0a0-cp310-cp310-linux_x86_64.whl#sha256=905a9df8a2a360ac719ed8ff72c14fb6e3f807e60f5652025ad382ef08d009b7
torchdata @ file:///opt/pytorch/data
torchmetrics==1.3.2
torchtext @ file:///opt/pytorch/text
torchvision @ file:///opt/pytorch/vision

Let me know if you want me to check the versions of other packages.

ryantwolf avatar Apr 03 '24 21:04 ryantwolf