pytorch_geometric
pytorch_geometric copied to clipboard
Error when trying to use NeighborLoader with HeteroData from networkx
🐛 Describe the bug
Hi, I'm trying to follow along the Hetergeneous Graph Learning tutorial on some actual data and when I get to the Graph Sampling step I'm running into an error I hope you can help me with. Here's my workflow:
- get data from BQ
- load it into a networkx
MultiGraph
- load that graph into a
HeteroData
object - attempt to send that object into a
NeighborLoader
for batching
Steps 1-3 work well (though I'd love some feedback on this process because it feels... suboptimal) but step 4 crashes with an Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
error. During debugging, I traced that error to this line but then got stuck.
I've recreated the error with a minimal dataset below. I did my best during inspection to make sure that my data
object and the one from the tutorial matched but so far no luck. Any ideas? Thanks so much in advance!
import numpy as np
import torch
from torch_geometric.data import HeteroData
from torch_geometric.utils.convert import from_networkx
from torch_geometric.loader import NeighborLoader
G = nx.MultiGraph()
nodes = [
(0, {'node_type':'person', 'node_feature': np.random.rand(5)}),
(1, {'node_type':'person', 'node_feature': np.random.rand(5)}),
(2, {'node_type':'person', 'node_feature': np.random.rand(5)}),
(3, {'node_type':'company', 'node_feature': np.random.rand(5)}),
(4, {'node_type':'company', 'node_feature': np.random.rand(5)}),
(5, {'node_type':'company', 'node_feature': np.random.rand(5)}),
]
G.add_nodes_from(nodes)
edges = [
(0, 1, {'edge_type':'is_friends_with'}),
(0, 2, {'edge_type':'is_friends_with'}),
(1, 2, {'edge_type':'is_friends_with'}),
(0, 2, {'edge_type':'has_worked_with'}),
(1, 2, {'edge_type':'has_worked_with'}),
(0, 3, {'edge_type':'has_worked_at'}),
(0, 4, {'edge_type':'has_worked_at'}),
(1, 4, {'edge_type':'has_worked_at'}),
(1, 5, {'edge_type':'has_worked_at'}),
(2, 5, {'edge_type':'has_worked_at'}),
(3, 5, {'edge_type':'same_industry_as'}),
]
G.add_edges_from(edges)
data = HeteroData(from_networkx(G))
# declare node features for each node type
for node_type in ['person','company']:
idx = torch.BoolTensor([x == node_type for x in data['node_type']])
data[node_type].x = data['node_feature'][idx]
# split out distinct edge types
idx = torch.BoolTensor([x == 'is_friends_with' for x in data['edge_type']])
data['person', 'is_friends_with', 'person'].edge_index = data['edge_index'][:, idx]
idx = torch.BoolTensor([x == 'has_worked_with' for x in data['edge_type']])
data['person', 'has_worked_with', 'person'].edge_index = data['edge_index'][:, idx]
idx = torch.BoolTensor([x == 'has_worked_at' for x in data['edge_type']])
data['person', 'has_worked_at', 'company'].edge_index = data['edge_index'][:, idx]
idx = torch.BoolTensor([x == 'same_industry_as' for x in data['edge_type']])
data['company', 'same_industry_as', 'company'].edge_index = data['edge_index'][:, idx]
# clean up data object
del data.edge_index
del data.node_id
del data.node_type
del data.node_feature
del data.edge_type
del data.num_nodes
print(data)
train_loader = NeighborLoader(
data,
# Sample 15 neighbors for each node and each edge type for 2 iterations:
num_neighbors=[15] * 1,
# Use a batch size of 128 for sampling training nodes of type "paper":
batch_size=128,
)
Environment
torch==1.12.0 torch-geometric==2.0.4 torch-scatter==2.0.9 torch-sparse==0.6.14
- OS: macOS 11.6.3
- Python version: 3.8.12
- CUDA/cuDNN version: so far working on CPU :)
- How you installed PyTorch and PyG: pip
- Any other relevant information (e.g., version of
torch-scatter
):
I haven‘t run the code yet, but keep in mind that edge indices between two node types are always local - if you want to use your original edges, you will not to decrement them accordingly. IMO, it is a good idea to not use networkx for graph creation in the first place since the two formats are a bit different.
Interesting! Would you be willing to provide an example of how to achieve the decrement? And in terms of your advice against using networkx, can you provide any insight on a more recommended way to load a graph from real world data? This was definitely something I struggled with in many of the tutorials I found... they used nicely processed and packaged datasets and so didn't provide much insight in this regard :)
Please take a look at https://pytorch-geometric.readthedocs.io/en/latest/notes/load_csv.html. Let me know if anything is unclear.
Thanks! I feel a bit silly for having missed that.
Any thoughts on that error though? It silently kills my notebook kernel... even if it's the wrong approach for loading data it'd be nice to get a bit more information :)
Yeah, the error is likely a segfault due to accessing memory regions outside your graph memory (due to the usage of out-of-boundary indices). We do not error out here gracefully unfortunately.
Understood. Thanks Mattias!