deepsnap
deepsnap copied to clipboard
Using the ogbl-biokg with DeepSNAP
Hi there!
I was just trying out to use the ogbl-biokg graph with DeepSNAP, more precisely using it as input for the link_prediction.py for heterogeneous graphs. Since deepSNAP requires a networkx or pytorch geometric object, I tried to convert the ogbl biokg graph into a pytorch geometric object and then to transform it to a HeteroGraph, as you point out in the tutorial here.
Yet, when I did that it threw an error since the graph would not have an 'edge_index':
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/storage.py in __getattr__(self, key)
47 try:
---> 48 return self[key]
49 except KeyError:
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/storage.py in __getitem__(self, key)
67 def __getitem__(self, key: str) -> Any:
---> 68 return self._mapping[key]
69
KeyError: 'edge_index'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
/tmp/ipykernel_42299/3068006995.py in <module>
----> 1 graph = Graph.pyg_to_graph(ogbl_biokg_dataset[0])
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/deepsnap/graph.py in pyg_to_graph(data, verbose, fixed_split, tensor_backend, netlib)
1991 if netlib is not None:
1992 deepsnap._netlib = netlib
-> 1993 if data.is_directed():
1994 G = deepsnap._netlib.DiGraph()
1995 else:
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/data.py in is_directed(self)
184 def is_directed(self) -> bool:
185 r"""Returns :obj:`True` if graph edges are directed."""
--> 186 return not self.is_undirected()
187
188 def clone(self):
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/data.py in is_undirected(self)
180 def is_undirected(self) -> bool:
181 r"""Returns :obj:`True` if graph edges are undirected."""
--> 182 return all([store.is_undirected() for store in self.edge_stores])
183
184 def is_directed(self) -> bool:
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/data.py in <listcomp>(.0)
180 def is_undirected(self) -> bool:
181 r"""Returns :obj:`True` if graph edges are undirected."""
--> 182 return all([store.is_undirected() for store in self.edge_stores])
183
184 def is_directed(self) -> bool:
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/storage.py in is_undirected(self)
395 return value.is_symmetric()
396
--> 397 edge_index = self.edge_index
398 edge_attr = self.edge_attr if 'edge_attr' in self else None
399 return is_undirected(edge_index, edge_attr, num_nodes=self.size(0))
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/storage.py in __getattr__(self, key)
48 return self[key]
49 except KeyError:
---> 50 raise AttributeError(
51 f"'{self.__class__.__name__}' object has no attribute '{key}'")
52
AttributeError: 'GlobalStorage' object has no attribute 'edge_index'
How can I convert the ogbl-biokg graph into an object that can be used with deepSNAP?
I would very much appreciate any help!
Hello! I'm not a maintainer of the deepsnap package but I would be happy to try and help. Would you mind posting the code that you used, that resulted in the error message, so it is easier to spot what went wrong?
I wonder whether what happened is that you created a heterogeneous graph object with v2 of pytorch-geometric, which might not yet be supported by deepsnap? If you have a pytorch-geometric HeteroData graph object, the edges of the different types are already stored in a dictionary, so then it wouldn't have an edge_index entry (but rather an edge_index_dict) and that migth be what the error message is complaining about?
A question for the developers of deepsnap: are you planning to update deepsnap to be compatible with HeteroData objects from pytorch-geometric >=v2? Can you share something about the roadmap for deepsnap in general?
Hi @anniekmyatt ! Thanks for chipping in on this. There are only a few lines of code I used for this:
from ogb.linkproppred import PygLinkPropPredDataset
ogbl_biokg_dataset = PygLinkPropPredDataset(name = "ogbl-biokg")
graph = Graph.pyg_to_graph(ogbl_biokg_dataset[0])
I just ran this and I'm getting the same error (I added the line from deepsnap.graph import Graph though).
This error occurs because the ogbl_biokg_dataset has an edge_index_dict rather than a single edge_index attribute. OGB uses this edge_index_dict dictionary to specify the edges for the different edge types. If you are keen to use deepsnap, rather than pytorch-geometric directly, it seems like you need to manually create the hetero graph object like here. However, the ogbl_biokg_dataset consists only of triplets, it doesn't have node features so you'll have to create some appropriate (or placeholder) features for the node_feature input. To create the deepsnap heterograph object your code would look something like this:
dataset = PygLinkPropPredDataset(name = "ogbl-biokg")
graph = dataset[0]
hetero_graph = HeteroGraph(
node_feature=<insert your node features here,>,
edge_index=graph.edge_index_dict, # Note that this is a dictionary with edge index for each edge type
directed=True)
About the node features: this should be a dictionary with keys of each node type (e.g. disease, drug...) and as values a torch tensor of dimension (number_of_nodes, number_of_features_per_node).
I am curious which deepsnap functionality specifically you would like to use? For an RDF graph like this (without node features), wouldn't a package like DGL-KE be more helpful as it has lots of embedding functionality that doesn't rely on message passing of node features?
Hi @anniekmyatt !
Thanks for your reply. I tried to create a deepsnap HeteroGraph object from scratch here for the ogblbiokg graph. I followed the tutorial from deepsnap for heterogeneous graphs to create the object.
One important step here is to relabel the nodes from the ogblbiokg since it starts with label 0 for every node type, but networkx requires consecutive node labels.
import torch
import tqdm
import numpy as np
from collections import defaultdict
import networkx as nx
from ogb.linkproppred import PygLinkPropPredDataset
ogbl_biokg_dataset = PygLinkPropPredDataset(name = "ogbl-biokg")
# =====================
# Relabel nodes
# =====================
## convert to array for speed
edge_split_array = dict()
for dataset in ['train', 'valid', 'test']:
edge_split_array[dataset] = dict()
for key in edge_split[dataset]:
if type(edge_split[dataset][key]) != list:
edge_split_array[dataset][key] = edge_split[dataset][key].numpy()
else:
edge_split_array[dataset][key] = np.array(edge_split[dataset][key])
# new node label
current_node_label = 0
# track nodes that have been seen
seen = set()
new_label_mapping = defaultdict(dict)
new_label_mapping_inv = defaultdict(dict)
for dataset in ['train', 'valid', 'test']:
for i in tqdm(range(len(edge_split_array[dataset]['head']))):
tmp_head_node = (edge_split_array[dataset]['head'][i], edge_split_array[dataset]['head_type'][i])
tmp_tail_node = (edge_split_array[dataset]['tail'][i], edge_split_array[dataset]['tail_type'][i])
if tmp_head_node not in seen:
seen.add(tmp_head_node)
new_label_mapping[current_node_label]['original_node_label'] = int(edge_split_array[dataset]['head'][i])
new_label_mapping[current_node_label]['node_type'] = edge_split_array[dataset]['head_type'][i]
new_label_mapping_inv[tmp_head_node] = current_node_label
current_node_label += 1
if tmp_tail_node not in seen:
seen.add(tmp_tail_node)
new_label_mapping[current_node_label]['original_node_label'] = int(edge_split_array[dataset]['tail'][i])
new_label_mapping[current_node_label]['node_type'] = edge_split_array[dataset]['tail_type'][i]
new_label_mapping_inv[tmp_tail_node] = current_node_label
current_node_label += 1
# =====================
# Create HeteroGraph
# =====================
G = nx.DiGraph()
for dataset in ['train', 'valid', 'test']:
for i in tqdm(range(len(edge_split_array[dataset]['head']))):
# head node
head_node_id = edge_split_array[dataset]['head'][i].item()
head_node_type = edge_split_array[dataset]['head_type'][i]
new_head_node_id = new_label_mapping_inv[(head_node_id, head_node_type)]
# tail node
tail_node_id = edge_split_array[dataset]['tail'][i].item()
tail_node_type = edge_split_array[dataset]['tail_type'][i]
new_tail_node_id = new_label_mapping_inv[(tail_node_id, tail_node_type)]
# edge type
edge_type_id = edge_split_array[dataset]['relation'][i].item()
edge_type_label = edge_index_to_type_mapping[edge_split_array[dataset]['relation'][i].item()]
G.add_node(new_head_node_id, node_type=head_node_type, node_label=head_node_type)
G.add_node(new_tail_node_id, node_type=tail_node_type, node_label=tail_node_type)
G.add_edge(new_head_node_id, new_tail_node_id, edge_type=str(edge_type_id))
When I run this code, it creates a networkx graph, as shown in the tutorial. I can also convert it into a HeteroGraph object from deepsnap with this:
# Transform to a heterograph object that is recognised by deepSNAP
hetero = HeteroGraph(G)
But the object does not have the attribute edges :
>>> hetero
HeteroGraph(G=[], edge_feature=[], edge_index=[], edge_label_index=[], edge_to_graph_mapping=[], edge_to_tensor_mapping=[3540567], edge_type=[], node_feature=[], node_label_index=[], node_to_graph_mapping=[], node_to_tensor_mapping=[93773], node_type=[])
>>> hetero.edges()
Traceback (most recent call last):
File "<string>", line 1, in <module>
AttributeError: 'HeteroGraph' object has no attribute 'edges'
I am wondering why this is, since I followed the tutorial of the authors. Do you have any idea? Would be great if any of the authors could comment on this @farzaank @JiaxuanYou @RexYing @jmilldotdev ?
P.S. The reason why I would like to use deepsnap is exactly that it can use node features, which I would add for another graph later on.