dgl
dgl copied to clipboard
[GraphBolt][Bug] SEGV when preprocessing `OnDiskDataset`
🐛 Bug
To Reproduce
When trying to construct a OnDiskDataset
with the UK-Union graph, I get segmentation fault during preprocessing. The error message is either munmap_chunk(): invalid pointer
or double free or corruption (out)
. I further locate the error comes from the following line:
https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L97
Steps to reproduce the behavior:
execute the code:
import dgl.graphbolt as gb
dataset = gb.OnDiskDataset("path/to/dataset")
Expected behavior
Environment
- DGL Version (e.g., 1.0): 2.1.0+cu121
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.1.2+cu121
- OS (e.g., Linux): Linux
- How you installed DGL (
conda
,pip
, source): pip - Build command you used (if compiling from source):
- Python version: 3.11
- CUDA/cuDNN version (if applicable):
- GPU models and configuration (e.g. V100):
- Any other relevant information:
Additional context
could you make sure the num_nodes
specified is exactly same as the node IDs read from edge file?
https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97
could you make sure the
num_nodes
specified is exactly same as the node IDs read from edge file? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97
Yes, the node ids in the edge file are consecutive from 0 to num_nodes -1
. Also I can construct the coo and csc matrix using scipy.sparse
.
how large is your dataset? num_nodes
, num_edges
?
And could you try to comment out below line? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L96C13-L96C23
num_nodes = 131814559 and num_edges = 5507679822. comment out is no use.
oh, it's a large graph with more than 5B edges. what's your instance for running this? how much is then RAM?
I'm running on an aliyun server with over 700GB RAM
@yxy235 could you try to reproduce this error on r6i.metal
with a random graph?
@yxy235 could you try to reproduce this error on
r6i.metal
with a random graph?
OK
I have tried to reproduce this, but I didn't get any errors with a random same-size graph.
@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy
, which is about 42GB)
@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed
edges.npy
, which is about 42GB)
OK. I have reproduced the error, I'm trying to debug now.
@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed
edges.npy
, which is about 42GB)
@easypickings Could you try to change the dtype of your edge.npy to int64
? I think this problem can be resolved. This problem is caused by edge number exceeds int32. This caused error during constructing SparseMatrix from coo to csc. The dtype change is a workaround to solve the problem temporarily. FYI, this workaround may cause double memoery consumption.
TBD: Functions used in https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L749 should be check, especisally https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L538. We should determine dtype of csr through coo.row->shape[0] rather than coo.row->dtype. If shape is bigger than MAX_INT32 and no matter coo.row->dtype is int32 or int64, we should use int64.
TBD: Functions used in
https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L749 should be check, especisally
https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L538 . We should determine dtype of csr through coo.row->shape[0] rather than coo.row->dtype. If shape is bigger than MAX_INT32 and no matter coo.row->dtype is int32 or int64, we should use int64.
@Skeleton003 please help work on this.