dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[GraphBolt][Bug] SEGV when preprocessing `OnDiskDataset`

Open easypickings opened this issue 10 months ago • 14 comments

🐛 Bug

To Reproduce

When trying to construct a OnDiskDataset with the UK-Union graph, I get segmentation fault during preprocessing. The error message is either munmap_chunk(): invalid pointer or double free or corruption (out). I further locate the error comes from the following line:

https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L97

Steps to reproduce the behavior:

execute the code:

import dgl.graphbolt as gb
dataset = gb.OnDiskDataset("path/to/dataset")

Expected behavior

Environment

  • DGL Version (e.g., 1.0): 2.1.0+cu121
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.1.2+cu121
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.11
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

easypickings avatar Apr 27 '24 06:04 easypickings

could you make sure the num_nodes specified is exactly same as the node IDs read from edge file? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97

Rhett-Ying avatar Apr 28 '24 01:04 Rhett-Ying

could you make sure the num_nodes specified is exactly same as the node IDs read from edge file? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97

Yes, the node ids in the edge file are consecutive from 0 to num_nodes -1. Also I can construct the coo and csc matrix using scipy.sparse.

easypickings avatar Apr 28 '24 04:04 easypickings

how large is your dataset? num_nodes, num_edges?

And could you try to comment out below line? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L96C13-L96C23

Rhett-Ying avatar Apr 28 '24 04:04 Rhett-Ying

num_nodes = 131814559 and num_edges = 5507679822. comment out is no use.

easypickings avatar Apr 28 '24 04:04 easypickings

oh, it's a large graph with more than 5B edges. what's your instance for running this? how much is then RAM?

Rhett-Ying avatar Apr 28 '24 04:04 Rhett-Ying

I'm running on an aliyun server with over 700GB RAM

easypickings avatar Apr 28 '24 07:04 easypickings

@yxy235 could you try to reproduce this error on r6i.metal with a random graph?

Rhett-Ying avatar Apr 28 '24 07:04 Rhett-Ying

@yxy235 could you try to reproduce this error on r6i.metal with a random graph?

OK

yxy235 avatar Apr 28 '24 07:04 yxy235

I have tried to reproduce this, but I didn't get any errors with a random same-size graph.

yxy235 avatar Apr 29 '24 07:04 yxy235

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

easypickings avatar Apr 29 '24 17:04 easypickings

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

OK. I have reproduced the error, I'm trying to debug now.

yxy235 avatar Apr 30 '24 06:04 yxy235

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

@easypickings Could you try to change the dtype of your edge.npy to int64? I think this problem can be resolved. This problem is caused by edge number exceeds int32. This caused error during constructing SparseMatrix from coo to csc. The dtype change is a workaround to solve the problem temporarily. FYI, this workaround may cause double memoery consumption.

yxy235 avatar May 06 '24 07:05 yxy235

TBD: Functions used in https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L749 should be check, especisally https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L538. We should determine dtype of csr through coo.row->shape[0] rather than coo.row->dtype. If shape is bigger than MAX_INT32 and no matter coo.row->dtype is int32 or int64, we should use int64.

yxy235 avatar May 16 '24 04:05 yxy235

TBD: Functions used in

https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L749 should be check, especisally

https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L538 . We should determine dtype of csr through coo.row->shape[0] rather than coo.row->dtype. If shape is bigger than MAX_INT32 and no matter coo.row->dtype is int32 or int64, we should use int64.

@Skeleton003 please help work on this.

Rhett-Ying avatar Jun 11 '24 08:06 Rhett-Ying