deepsnap icon indicating copy to clipboard operation
deepsnap copied to clipboard

Using the ogbl-biokg with DeepSNAP

Open sophiakrix opened this issue 3 years ago • 4 comments

Hi there!

I was just trying out to use the ogbl-biokg graph with DeepSNAP, more precisely using it as input for the link_prediction.py for heterogeneous graphs. Since deepSNAP requires a networkx or pytorch geometric object, I tried to convert the ogbl biokg graph into a pytorch geometric object and then to transform it to a HeteroGraph, as you point out in the tutorial here.

Yet, when I did that it threw an error since the graph would not have an 'edge_index':

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/storage.py in __getattr__(self, key)
     47         try:
---> 48             return self[key]
     49         except KeyError:

~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/storage.py in __getitem__(self, key)
     67     def __getitem__(self, key: str) -> Any:
---> 68         return self._mapping[key]
     69 

KeyError: 'edge_index'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_42299/3068006995.py in <module>
----> 1 graph = Graph.pyg_to_graph(ogbl_biokg_dataset[0])

~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/deepsnap/graph.py in pyg_to_graph(data, verbose, fixed_split, tensor_backend, netlib)
   1991             if netlib is not None:
   1992                 deepsnap._netlib = netlib
-> 1993             if data.is_directed():
   1994                 G = deepsnap._netlib.DiGraph()
   1995             else:

~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/data.py in is_directed(self)
    184     def is_directed(self) -> bool:
    185         r"""Returns :obj:`True` if graph edges are directed."""
--> 186         return not self.is_undirected()
    187 
    188     def clone(self):

~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/data.py in is_undirected(self)
    180     def is_undirected(self) -> bool:
    181         r"""Returns :obj:`True` if graph edges are undirected."""
--> 182         return all([store.is_undirected() for store in self.edge_stores])
    183 
    184     def is_directed(self) -> bool:

~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/data.py in <listcomp>(.0)
    180     def is_undirected(self) -> bool:
    181         r"""Returns :obj:`True` if graph edges are undirected."""
--> 182         return all([store.is_undirected() for store in self.edge_stores])
    183 
    184     def is_directed(self) -> bool:

~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/storage.py in is_undirected(self)
    395             return value.is_symmetric()
    396 
--> 397         edge_index = self.edge_index
    398         edge_attr = self.edge_attr if 'edge_attr' in self else None
    399         return is_undirected(edge_index, edge_attr, num_nodes=self.size(0))

~/.conda/envs/env_ogbl_gpu/lib/python3.8/site-packages/torch_geometric/data/storage.py in __getattr__(self, key)
     48             return self[key]
     49         except KeyError:
---> 50             raise AttributeError(
     51                 f"'{self.__class__.__name__}' object has no attribute '{key}'")
     52 

AttributeError: 'GlobalStorage' object has no attribute 'edge_index'

How can I convert the ogbl-biokg graph into an object that can be used with deepSNAP?

I would very much appreciate any help!

sophiakrix avatar Dec 09 '21 15:12 sophiakrix

Hello! I'm not a maintainer of the deepsnap package but I would be happy to try and help. Would you mind posting the code that you used, that resulted in the error message, so it is easier to spot what went wrong?

I wonder whether what happened is that you created a heterogeneous graph object with v2 of pytorch-geometric, which might not yet be supported by deepsnap? If you have a pytorch-geometric HeteroData graph object, the edges of the different types are already stored in a dictionary, so then it wouldn't have an edge_index entry (but rather an edge_index_dict) and that migth be what the error message is complaining about?

A question for the developers of deepsnap: are you planning to update deepsnap to be compatible with HeteroData objects from pytorch-geometric >=v2? Can you share something about the roadmap for deepsnap in general?

anniekmyatt avatar Dec 12 '21 21:12 anniekmyatt

Hi @anniekmyatt ! Thanks for chipping in on this. There are only a few lines of code I used for this:

from ogb.linkproppred import PygLinkPropPredDataset

ogbl_biokg_dataset = PygLinkPropPredDataset(name = "ogbl-biokg")
graph = Graph.pyg_to_graph(ogbl_biokg_dataset[0])

sophiakrix avatar Dec 14 '21 15:12 sophiakrix

I just ran this and I'm getting the same error (I added the line from deepsnap.graph import Graph though).

This error occurs because the ogbl_biokg_dataset has an edge_index_dict rather than a single edge_index attribute. OGB uses this edge_index_dict dictionary to specify the edges for the different edge types. If you are keen to use deepsnap, rather than pytorch-geometric directly, it seems like you need to manually create the hetero graph object like here. However, the ogbl_biokg_dataset consists only of triplets, it doesn't have node features so you'll have to create some appropriate (or placeholder) features for the node_feature input. To create the deepsnap heterograph object your code would look something like this:

dataset = PygLinkPropPredDataset(name = "ogbl-biokg")
graph = dataset[0] 
hetero_graph = HeteroGraph(
     node_feature=<insert your node features here,>,
     edge_index=graph.edge_index_dict, # Note that this is a dictionary with edge index for each edge type
     directed=True)

About the node features: this should be a dictionary with keys of each node type (e.g. disease, drug...) and as values a torch tensor of dimension (number_of_nodes, number_of_features_per_node).

I am curious which deepsnap functionality specifically you would like to use? For an RDF graph like this (without node features), wouldn't a package like DGL-KE be more helpful as it has lots of embedding functionality that doesn't rely on message passing of node features?

anniekmyatt avatar Dec 14 '21 22:12 anniekmyatt

Hi @anniekmyatt !

Thanks for your reply. I tried to create a deepsnap HeteroGraph object from scratch here for the ogblbiokg graph. I followed the tutorial from deepsnap for heterogeneous graphs to create the object.

One important step here is to relabel the nodes from the ogblbiokg since it starts with label 0 for every node type, but networkx requires consecutive node labels.

import torch
import tqdm
import numpy as np
from collections import defaultdict
import networkx as nx
from ogb.linkproppred import PygLinkPropPredDataset


ogbl_biokg_dataset = PygLinkPropPredDataset(name = "ogbl-biokg")

# =====================
# Relabel nodes
# =====================

## convert to array for speed
edge_split_array = dict()
for dataset in ['train', 'valid', 'test']:
    edge_split_array[dataset] = dict()
    for key in edge_split[dataset]:
        if type(edge_split[dataset][key]) != list:
            edge_split_array[dataset][key] = edge_split[dataset][key].numpy()
        else: 
            edge_split_array[dataset][key] = np.array(edge_split[dataset][key])

# new node label
current_node_label = 0
# track nodes that have been seen
seen = set()

new_label_mapping = defaultdict(dict)
new_label_mapping_inv = defaultdict(dict)

for dataset in ['train', 'valid', 'test']:
    for i in tqdm(range(len(edge_split_array[dataset]['head']))):

        tmp_head_node = (edge_split_array[dataset]['head'][i], edge_split_array[dataset]['head_type'][i])
        tmp_tail_node = (edge_split_array[dataset]['tail'][i], edge_split_array[dataset]['tail_type'][i])

        if tmp_head_node not in seen:

            seen.add(tmp_head_node)
            new_label_mapping[current_node_label]['original_node_label'] = int(edge_split_array[dataset]['head'][i])
            new_label_mapping[current_node_label]['node_type'] = edge_split_array[dataset]['head_type'][i]
            new_label_mapping_inv[tmp_head_node] = current_node_label
            current_node_label += 1

        if tmp_tail_node not in seen:

            seen.add(tmp_tail_node)
            new_label_mapping[current_node_label]['original_node_label'] = int(edge_split_array[dataset]['tail'][i])
            new_label_mapping[current_node_label]['node_type'] = edge_split_array[dataset]['tail_type'][i]
            new_label_mapping_inv[tmp_tail_node] = current_node_label
            current_node_label += 1


# =====================
# Create HeteroGraph
# =====================
G = nx.DiGraph()

for dataset in ['train', 'valid', 'test']:
    for i in tqdm(range(len(edge_split_array[dataset]['head']))):
        
        # head node
        head_node_id = edge_split_array[dataset]['head'][i].item()
        head_node_type = edge_split_array[dataset]['head_type'][i]
        new_head_node_id = new_label_mapping_inv[(head_node_id, head_node_type)]

        # tail node
        tail_node_id = edge_split_array[dataset]['tail'][i].item()
        tail_node_type = edge_split_array[dataset]['tail_type'][i]
        new_tail_node_id = new_label_mapping_inv[(tail_node_id, tail_node_type)]

        # edge type
        edge_type_id = edge_split_array[dataset]['relation'][i].item()
        edge_type_label = edge_index_to_type_mapping[edge_split_array[dataset]['relation'][i].item()]

        G.add_node(new_head_node_id, node_type=head_node_type, node_label=head_node_type)
        G.add_node(new_tail_node_id, node_type=tail_node_type, node_label=tail_node_type)
        G.add_edge(new_head_node_id, new_tail_node_id, edge_type=str(edge_type_id))
        

When I run this code, it creates a networkx graph, as shown in the tutorial. I can also convert it into a HeteroGraph object from deepsnap with this:

# Transform to a heterograph object that is recognised by deepSNAP
hetero = HeteroGraph(G)

But the object does not have the attribute edges :

>>> hetero
HeteroGraph(G=[], edge_feature=[], edge_index=[], edge_label_index=[], edge_to_graph_mapping=[], edge_to_tensor_mapping=[3540567], edge_type=[], node_feature=[], node_label_index=[], node_to_graph_mapping=[], node_to_tensor_mapping=[93773], node_type=[])

>>> hetero.edges()
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: 'HeteroGraph' object has no attribute 'edges'

I am wondering why this is, since I followed the tutorial of the authors. Do you have any idea? Would be great if any of the authors could comment on this @farzaank @JiaxuanYou @RexYing @jmilldotdev ?

P.S. The reason why I would like to use deepsnap is exactly that it can use node features, which I would add for another graph later on.

sophiakrix avatar Dec 21 '21 16:12 sophiakrix