pytorch_geometric icon indicating copy to clipboard operation
pytorch_geometric copied to clipboard

Error when trying to use NeighborLoader with HeteroData from networkx

Open drewdrewdrewdrew opened this issue 2 years ago • 5 comments

🐛 Describe the bug

Hi, I'm trying to follow along the Hetergeneous Graph Learning tutorial on some actual data and when I get to the Graph Sampling step I'm running into an error I hope you can help me with. Here's my workflow:

  1. get data from BQ
  2. load it into a networkx MultiGraph
  3. load that graph into a HeteroData object
  4. attempt to send that object into a NeighborLoader for batching

Steps 1-3 work well (though I'd love some feedback on this process because it feels... suboptimal) but step 4 crashes with an Process finished with exit code 134 (interrupted by signal 6: SIGABRT) error. During debugging, I traced that error to this line but then got stuck.

I've recreated the error with a minimal dataset below. I did my best during inspection to make sure that my data object and the one from the tutorial matched but so far no luck. Any ideas? Thanks so much in advance!

import numpy as np
import torch
from torch_geometric.data import HeteroData
from torch_geometric.utils.convert import from_networkx
from torch_geometric.loader import NeighborLoader


G = nx.MultiGraph()

nodes = [
    (0, {'node_type':'person', 'node_feature': np.random.rand(5)}),
    (1, {'node_type':'person', 'node_feature': np.random.rand(5)}),
    (2, {'node_type':'person', 'node_feature': np.random.rand(5)}),

    (3, {'node_type':'company', 'node_feature': np.random.rand(5)}),
    (4, {'node_type':'company', 'node_feature': np.random.rand(5)}),
    (5, {'node_type':'company', 'node_feature': np.random.rand(5)}),
]
G.add_nodes_from(nodes)

edges = [
    (0, 1, {'edge_type':'is_friends_with'}),
    (0, 2, {'edge_type':'is_friends_with'}),
    (1, 2, {'edge_type':'is_friends_with'}),

    (0, 2, {'edge_type':'has_worked_with'}),
    (1, 2, {'edge_type':'has_worked_with'}),

    (0, 3, {'edge_type':'has_worked_at'}),
    (0, 4, {'edge_type':'has_worked_at'}),
    (1, 4, {'edge_type':'has_worked_at'}),
    (1, 5, {'edge_type':'has_worked_at'}),
    (2, 5, {'edge_type':'has_worked_at'}),

    (3, 5, {'edge_type':'same_industry_as'}),
]
G.add_edges_from(edges)

data = HeteroData(from_networkx(G))

# declare node features for each node type
for node_type in ['person','company']:
    idx = torch.BoolTensor([x == node_type for x in data['node_type']])
    data[node_type].x = data['node_feature'][idx]

# split out distinct edge types
idx = torch.BoolTensor([x == 'is_friends_with' for x in data['edge_type']])
data['person', 'is_friends_with', 'person'].edge_index = data['edge_index'][:, idx]

idx = torch.BoolTensor([x == 'has_worked_with' for x in data['edge_type']])
data['person', 'has_worked_with', 'person'].edge_index = data['edge_index'][:, idx]

idx = torch.BoolTensor([x == 'has_worked_at' for x in data['edge_type']])
data['person', 'has_worked_at', 'company'].edge_index = data['edge_index'][:, idx]

idx = torch.BoolTensor([x == 'same_industry_as' for x in data['edge_type']])
data['company', 'same_industry_as', 'company'].edge_index = data['edge_index'][:, idx]

# clean up data object
del data.edge_index
del data.node_id
del data.node_type
del data.node_feature
del data.edge_type
del data.num_nodes


print(data)

train_loader = NeighborLoader(
    data,
    # Sample 15 neighbors for each node and each edge type for 2 iterations:
    num_neighbors=[15] * 1,
    # Use a batch size of 128 for sampling training nodes of type "paper":
    batch_size=128,
)

Environment

torch==1.12.0 torch-geometric==2.0.4 torch-scatter==2.0.9 torch-sparse==0.6.14

  • OS: macOS 11.6.3
  • Python version: 3.8.12
  • CUDA/cuDNN version: so far working on CPU :)
  • How you installed PyTorch and PyG: pip
  • Any other relevant information (e.g., version of torch-scatter):

drewdrewdrewdrew avatar Jul 22 '22 14:07 drewdrewdrewdrew

I haven‘t run the code yet, but keep in mind that edge indices between two node types are always local - if you want to use your original edges, you will not to decrement them accordingly. IMO, it is a good idea to not use networkx for graph creation in the first place since the two formats are a bit different.

rusty1s avatar Jul 23 '22 07:07 rusty1s

Interesting! Would you be willing to provide an example of how to achieve the decrement? And in terms of your advice against using networkx, can you provide any insight on a more recommended way to load a graph from real world data? This was definitely something I struggled with in many of the tutorials I found... they used nicely processed and packaged datasets and so didn't provide much insight in this regard :)

drewdrewdrewdrew avatar Jul 23 '22 19:07 drewdrewdrewdrew

Please take a look at https://pytorch-geometric.readthedocs.io/en/latest/notes/load_csv.html. Let me know if anything is unclear.

rusty1s avatar Jul 24 '22 04:07 rusty1s

Thanks! I feel a bit silly for having missed that.

Any thoughts on that error though? It silently kills my notebook kernel... even if it's the wrong approach for loading data it'd be nice to get a bit more information :)

drewdrewdrewdrew avatar Jul 24 '22 13:07 drewdrewdrewdrew

Yeah, the error is likely a segfault due to accessing memory regions outside your graph memory (due to the usage of out-of-boundary indices). We do not error out here gracefully unfortunately.

rusty1s avatar Jul 24 '22 17:07 rusty1s

Understood. Thanks Mattias!

drewdrewdrewdrew avatar Aug 23 '22 08:08 drewdrewdrewdrew