pytorch_geometric icon indicating copy to clipboard operation
pytorch_geometric copied to clipboard

Random errors when loading OGB datasets

Open CongWeilin opened this issue 3 years ago • 1 comments

🐛 Describe the bug

Receive random error message IndexError: index 242823520 is out of bounds for dimension 0 with size 123718280 error when loading OGB-Products using

dataset = PygNodePropPredDataset(name ='ogbn-products', transform = T.ToSparseTensor())

This error doesn't happen every time... sometimes I have it, sometimes I don't. This has happened after I updated the PyTorch and PyTorch-Geometric versions recently ...

I reinstalled using pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html --upgrade --force-reinstall but still didn't work.

Environment

  • PyG version: 2.0.4
  • PyTorch version: 1.11.0+cu113
  • OS: Ubuntu 18.04
  • Python version: 3.8.5
  • CUDA/cuDNN version:
  • How you installed PyTorch and PyG (conda, pip, source): pip
  • Any other relevant information (e.g., version of torch-scatter):

Installed by command pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.11.0+cu113.html

CongWeilin avatar Aug 04 '22 04:08 CongWeilin

Can you try to remove the processed/ data of the OGB datasets when using a different PyG version? Hopefully, this will already resolve your issues.

rusty1s avatar Aug 04 '22 06:08 rusty1s

Sorry for the late reply. The issue is still not fixed. I tried another version of pytorch=1.12.1+cu113 but still not work.

I found this issue happens as long as the dataset size is large. For example, when I am using Reddit dataset from PyG, it gives me the following error when creating an torch_sparse.SparseTensor:

~/GNN_models.py in get_feats(self, data, y_pred)
     73     def get_feats(self, data, y_pred=None):
     74 
---> 75         adj_t = torch_sparse.SparseTensor(
     76             row = data.edge_index[0].long(),
     77             col = data.edge_index[1].long(),

~/anaconda3/lib/python3.8/site-packages/torch_sparse/tensor.py in __init__(self, row, rowptr, col, value, sparse_sizes, is_sorted, trust_data)
     24         trust_data: bool = False,
     25     ):
---> 26         self.storage = SparseStorage(
     27             row=row,
     28             rowptr=rowptr,

~/anaconda3/lib/python3.8/site-packages/torch_sparse/storage.py in __init__(self, row, rowptr, col, value, sparse_sizes, rowcount, colptr, colcount, csr2csc, csc2csr, is_sorted, trust_data)
     66                 assert rowptr.numel() - 1 == M
     67             elif row is not None and row.numel() > 0:
---> 68                 assert trust_data or int(row.max()) < M
     69 
     70         N: int = 0

AssertionError: 

I also have saw the above error before when using OGB, but the error pops up randomly ...

CongWeilin avatar Aug 18 '22 18:08 CongWeilin

Do you have a reproducible example on reddit?

rusty1s avatar Aug 18 '22 20:08 rusty1s

Yes, the following is a very simple example. The error happens randomly ... I would say 50% of chance I could see this error ...

from torch_geometric.datasets import Reddit
path = osp.join('./data', name)
dataset = Reddit(root=path)
data = dataset[0]

from torch_geometric.loader import ShaDowKHopSampler
train_loader = ShaDowKHopSampler(data, depth=2, num_neighbors=5, batch_size=256, num_workers=10, shuffle=True)

Since ShaDowKHopSampler is calling SparseTensor, it sometime give me this error ...

CongWeilin avatar Aug 18 '22 20:08 CongWeilin

It also has these types of random errors for Reddit dataset ...

~/anaconda3/lib/python3.8/site-packages/torch_geometric/loader/shadow.py in __init__(self, data, depth, num_neighbors, node_idx, replace, **kwargs)
     49             self.is_sparse_tensor = False
     50             row, col = data.edge_index.cpu()
---> 51             self.adj_t = SparseTensor(
     52                 row=row, col=col, value=torch.arange(col.size(0)),
     53                 sparse_sizes=(data.num_nodes, data.num_nodes)).t()

~/anaconda3/lib/python3.8/site-packages/torch_sparse/transpose.py in <lambda>(self)
     32 
     33 
---> 34 SparseTensor.t = lambda self: t(self)
     35 
     36 ###############################################################################

~/anaconda3/lib/python3.8/site-packages/torch_sparse/transpose.py in t(src)
     11 
     12     if value is not None:
---> 13         value = value[csr2csc]
     14 
     15     sparse_sizes = src.storage.sparse_sizes()

IndexError: index 36028797059905686 is out of bounds for dimension 0 with size 114615892

CongWeilin avatar Aug 18 '22 23:08 CongWeilin

Thank you. I tried to reproduce this but failed :( I guess this has something to do with a broken torch-sparse installation. Can you try to remove the dependency and try to install from source via pip install --verbose torch-sparse? Sorry for the inconveniences!

rusty1s avatar Aug 19 '22 12:08 rusty1s

Problem solved. It turns out my RAM is broken ... Never thought it could due to a hardware issue. Thank you for your time.

CongWeilin avatar Sep 10 '22 03:09 CongWeilin