GraphSAGE icon indicating copy to clipboard operation
GraphSAGE copied to clipboard

Generating graph from data & Questions

Open amit-pande-2018 opened this issue 7 years ago • 5 comments

I am quiet new to graphs and am trying to translate my datasets (in HDFS, which I can read using scala/python/hive) to networkX graphs of the format -G.json The datasets are obviously not in graph format but as guest transactions, which I can translate to graph.

First, is there any utiliity to do that. Secondly, what is the use of features and label in the graph description. How are these features different from -feat.npy features for nodes. Is label only for supervised learning?

Third, is there any talk or detailed slides about the implementation? I got some hints from the paper, but a talk makes it easier to follow i guess (I have seen Jure's recent talks on this but they are overview talks, I was looking to detailed ones).

amit-pande-2018 avatar Feb 23 '18 22:02 amit-pande-2018

  1. Networkx has a very easy-to-use interface for constructing graphs from data. You could certainly do it relatively fast. Also note that since networkx (naive python implementation in general) is not fast when dealing with large dataset, I would recommend converting data to re-indexed smaller adjacency matrices based on minibatches at training time. Let me know if it's needed and you need more explanation on that.

  2. Label is just for supervised learning. Features for nodes typically describe the entity that nodes in network represent. Eg. in PPI network, the features are characteristics of proteins; in social network, the features tell information about users. feat.npy saves these features (num_nodes x feature_dim) into the numpy format.

  3. Maybe Will has a bit more slides. For details related to the paper, I'd also recommend the graphsage-simple repo (https://github.com/williamleif/graphsage-simple), which re-writes the simplest variant of the algorithm in PyTorch, and is very easy to understand compared to this TF full version. Hope that would help!

Rex

RexYing avatar Mar 15 '18 23:03 RexYing

Rex thanks for the comments. Figured out the rest.

amitpande123 avatar Mar 19 '18 13:03 amitpande123

Hi Rex Ying! I'm also a newbie in networkx and I'm facing the same problem as of @amitpande123 . Could you please explain in more detail this senctence: "I would recommend converting data to re-indexed smaller adjacency matrices based on minibatches at training time." I'm quite a bit confused about this. Thanks in advance!

binh-ml avatar Jun 04 '18 05:06 binh-ml

Sorry for the late response. Suppose that (v1, v2, ... ) are your minibatch of nodes. You need to find all nodes in the neighborhood of each of (v1, v2, ...). Suppose that V' is the set of all nodes in these neighborhoods. Now construct a subgraph G' of the original graph G, that contains only V' nodes and edges between them. Obtain the adjacency matrix of the subgraph G'. You now re-index each node in V' according to the order in the new adjacency matrix.

Using the new index you build the feature matrix, for each row i, you fill in the features of the i-th node in G'.

Let me know if you have more questions.

RexYing avatar Nov 08 '18 22:11 RexYing

Hi @RexYing , I am trying to train graphsage on my dataset and I have created the graph using networkx, though there's a minibatch dataset requirement which has links just nodes as one of the dict with train_removed as one of the filed. This info is not present anywhere otherwise. Can you please help me understand how to create the dataset in the required format or share some insight into how can I do so by using Graphsage? Any help would be appreciated!

Thanks Nidhi

NidhiSultan avatar Sep 01 '20 03:09 NidhiSultan