dgl icon indicating copy to clipboard operation
dgl copied to clipboard

OnDiskDataset Preprocessing crashes with graph more than 2B edges

Open byingyang opened this issue 11 months ago • 5 comments

🐛 Bug

When I created all the edges files for an OnDiskDataset where I casted all the src and dst to int32 type (since we do not have billions of nodes yet), the preprocessing stage crashed with an int32 overflow error:

The on-disk dataset is re-preprocessing, so the existing preprocessed dataset has been removed.
Start to preprocess the on-disk dataset.

RuntimeError: [20:25:19] /opt/dgl/src/array/cpu/spmat_op_impl_coo.cc:749: Check failed: (coo.row->shape[0]) <= 0x7FFFFFFFL (2283022784 vs. 2147483647) : int32 overflow for argument coo.row->shape[0].
Stack trace:
  [bt] (0) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(+0x61fbc4) [0x7f34bc81fbc4]
  [bt] (1) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DGLDeviceType)1, int>(dgl::aten::COOMatrix)+0x121) [0x7f34bc82ac81]
  [bt] (2) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x451) [0x7f34bc5b43a1]
  [bt] (3) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::COOToCSC(std::shared_ptr<dgl::sparse::COO> const&)+0x17d) [0x7f3394a77f2d]
  [bt] (4) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::_CreateCSC()+0x14d) [0x7f3394a7c14d]
  [bt] (5) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCPtr()+0x5d) [0x7f3394a7c24d]
  [bt] (6) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCTensors()+0x13) [0x7f3394a7ce63]
  [bt] (7) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<dgl::sparse::SparseMatrix>::defineMethod<torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()> >(std::string, torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()>, std::string, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&)+0x82) [0x7f3394a65802]
  [bt] (8) /databricks/python/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0xa80f7e) [0x7f357f678f7e]

----> 2 dataset = gb.OnDiskDataset(base_dir, force_preprocess=True).load()
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:688, in OnDiskDataset.__init__(self, path, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
    678 def __init__(
    679     self,
    680     path: str,
   (...)
    685     # Always call the preprocess function first. If already preprocessed,
    686     # the function will return the original path directly.
    687     self._dataset_dir = path
--> 688     yaml_path = preprocess_ondisk_dataset(
    689         path,
    690         include_original_edge_id,
    691         force_preprocess,
    692         auto_cast_to_optimal_dtype,
    693     )
    694     with open(yaml_path) as f:
    695         self._yaml_data = yaml.load(f, Loader=yaml.loader.SafeLoader)
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:407, in preprocess_ondisk_dataset(dataset_dir, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
    404 if "graph" not in input_config:
    405     raise RuntimeError("Invalid config: does not contain graph field.")
--> 407 sampling_graph = _graph_data_to_fused_csc_sampling_graph(
    408     dataset_dir,
    409     input_config["graph"],
    410     include_original_edge_id,
    411     auto_cast_to_optimal_dtype,
    412 )
    414 # 3. Record value of include_original_edge_id.
    415 output_config["include_original_edge_id"] = include_original_edge_id
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:166, in _graph_data_to_fused_csc_sampling_graph(dataset_dir, graph_data, include_original_edge_id, auto_cast_to_optimal_dtype)
    161 sparse_matrix = spmatrix(
    162     indices=torch.stack((coo_src, coo_dst), dim=0),
    163     shape=(total_num_nodes, total_num_nodes),
    164 )
    165 del coo_src, coo_dst
--> 166 indptr, indices, edge_ids = sparse_matrix.csc()
    167 del sparse_matrix
    169 if auto_cast_to_optimal_dtype:
File /databricks/python/lib/python3.11/site-packages/dgl/sparse/sparse_matrix.py:201, in SparseMatrix.csc(self)
    172 def csc(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    173     r"""Returns the compressed sparse column (CSC) representation of the
    174     sparse matrix.
    175 
   (...)
    199     (tensor([0, 0, 0, 1, 2, 3]), tensor([1, 1, 2]), tensor([0, 2, 1]))
    200     """
--> 201     return self.c_sparse_matrix.csc()

To Reproduce

Steps to reproduce the behavior:

  1. Create OnDiskDataset with edges in npy files that have all ints casted to int32, with a # of edges > int32 size.
  2. Load dataset and preprocess

Expected behavior

In order to get around this issue, I have to double my CPU memory usage by not casting the ints. So then there seems to be no memory savings when we switched to graphbolt.

Environment

  • DGL Version (e.g., 1.0):
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

byingyang avatar Dec 31 '24 20:12 byingyang

This is not expected, we are successfully using int32 for the ogbn-papers100M dataset, which has over 3B edges. @Rhett-Ying what do you think is the core issue here?

mfbalin avatar Jan 25 '25 06:01 mfbalin

Since there is a preprocessing step, cast your data to int64, then let the preprocessing run. After preprocessing, when you load the gb.CSCSamplingGraph, the dtype of the edges should be back to int32.

mfbalin avatar Jan 25 '25 06:01 mfbalin

The preprocessing steps use DGL underneath, which does not support mixed dtypes for the indptr and indices tensors.

mfbalin avatar Jan 25 '25 06:01 mfbalin

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Feb 25 '25 01:02 github-actions[bot]

I encountered a similar issue with mixed data types — specifically, using int32 for node IDs and int64 for edge IDs. I chose this approach due to scalability concerns, as the data wouldn't fit into 500GB of RAM otherwise.

I only found out later that DGL GraphBolt already supports this use case.

For anyone else struggling with mixed data types, please refer to the following release note: https://www.dgl.ai/release/2024/03/06/release.html

Hope this helps someone avoid the same confusion I had.

BJohn-dev avatar Jul 10 '25 02:07 BJohn-dev