OnDiskDataset Preprocessing crashes with graph more than 2B edges
🐛 Bug
When I created all the edges files for an OnDiskDataset where I casted all the src and dst to int32 type (since we do not have billions of nodes yet), the preprocessing stage crashed with an int32 overflow error:
The on-disk dataset is re-preprocessing, so the existing preprocessed dataset has been removed.
Start to preprocess the on-disk dataset.
RuntimeError: [20:25:19] /opt/dgl/src/array/cpu/spmat_op_impl_coo.cc:749: Check failed: (coo.row->shape[0]) <= 0x7FFFFFFFL (2283022784 vs. 2147483647) : int32 overflow for argument coo.row->shape[0].
Stack trace:
[bt] (0) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(+0x61fbc4) [0x7f34bc81fbc4]
[bt] (1) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DGLDeviceType)1, int>(dgl::aten::COOMatrix)+0x121) [0x7f34bc82ac81]
[bt] (2) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x451) [0x7f34bc5b43a1]
[bt] (3) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::COOToCSC(std::shared_ptr<dgl::sparse::COO> const&)+0x17d) [0x7f3394a77f2d]
[bt] (4) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::_CreateCSC()+0x14d) [0x7f3394a7c14d]
[bt] (5) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCPtr()+0x5d) [0x7f3394a7c24d]
[bt] (6) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCTensors()+0x13) [0x7f3394a7ce63]
[bt] (7) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<dgl::sparse::SparseMatrix>::defineMethod<torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()> >(std::string, torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()>, std::string, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&)+0x82) [0x7f3394a65802]
[bt] (8) /databricks/python/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0xa80f7e) [0x7f357f678f7e]
----> 2 dataset = gb.OnDiskDataset(base_dir, force_preprocess=True).load()
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:688, in OnDiskDataset.__init__(self, path, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
678 def __init__(
679 self,
680 path: str,
(...)
685 # Always call the preprocess function first. If already preprocessed,
686 # the function will return the original path directly.
687 self._dataset_dir = path
--> 688 yaml_path = preprocess_ondisk_dataset(
689 path,
690 include_original_edge_id,
691 force_preprocess,
692 auto_cast_to_optimal_dtype,
693 )
694 with open(yaml_path) as f:
695 self._yaml_data = yaml.load(f, Loader=yaml.loader.SafeLoader)
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:407, in preprocess_ondisk_dataset(dataset_dir, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
404 if "graph" not in input_config:
405 raise RuntimeError("Invalid config: does not contain graph field.")
--> 407 sampling_graph = _graph_data_to_fused_csc_sampling_graph(
408 dataset_dir,
409 input_config["graph"],
410 include_original_edge_id,
411 auto_cast_to_optimal_dtype,
412 )
414 # 3. Record value of include_original_edge_id.
415 output_config["include_original_edge_id"] = include_original_edge_id
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:166, in _graph_data_to_fused_csc_sampling_graph(dataset_dir, graph_data, include_original_edge_id, auto_cast_to_optimal_dtype)
161 sparse_matrix = spmatrix(
162 indices=torch.stack((coo_src, coo_dst), dim=0),
163 shape=(total_num_nodes, total_num_nodes),
164 )
165 del coo_src, coo_dst
--> 166 indptr, indices, edge_ids = sparse_matrix.csc()
167 del sparse_matrix
169 if auto_cast_to_optimal_dtype:
File /databricks/python/lib/python3.11/site-packages/dgl/sparse/sparse_matrix.py:201, in SparseMatrix.csc(self)
172 def csc(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
173 r"""Returns the compressed sparse column (CSC) representation of the
174 sparse matrix.
175
(...)
199 (tensor([0, 0, 0, 1, 2, 3]), tensor([1, 1, 2]), tensor([0, 2, 1]))
200 """
--> 201 return self.c_sparse_matrix.csc()
To Reproduce
Steps to reproduce the behavior:
- Create OnDiskDataset with edges in npy files that have all ints casted to int32, with a # of edges > int32 size.
- Load dataset and preprocess
Expected behavior
In order to get around this issue, I have to double my CPU memory usage by not casting the ints. So then there seems to be no memory savings when we switched to graphbolt.
Environment
- DGL Version (e.g., 1.0):
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
- OS (e.g., Linux):
- How you installed DGL (
conda,pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version (if applicable):
- GPU models and configuration (e.g. V100):
- Any other relevant information:
Additional context
This is not expected, we are successfully using int32 for the ogbn-papers100M dataset, which has over 3B edges. @Rhett-Ying what do you think is the core issue here?
Since there is a preprocessing step, cast your data to int64, then let the preprocessing run. After preprocessing, when you load the gb.CSCSamplingGraph, the dtype of the edges should be back to int32.
The preprocessing steps use DGL underneath, which does not support mixed dtypes for the indptr and indices tensors.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
I encountered a similar issue with mixed data types — specifically, using int32 for node IDs and int64 for edge IDs. I chose this approach due to scalability concerns, as the data wouldn't fit into 500GB of RAM otherwise.
I only found out later that DGL GraphBolt already supports this use case.
For anyone else struggling with mixed data types, please refer to the following release note: https://www.dgl.ai/release/2024/03/06/release.html
Hope this helps someone avoid the same confusion I had.