neural-subgraph-learning-GNN
neural-subgraph-learning-GNN copied to clipboard
How to train the encoder for our own data? (A Knowledge graph and sample query)
Hi,
I have a target graph in the form of a directed networkx graphs with 14M nodes and 54M edges. I wanted to know how can I make use of this target graph along with another query graph (of size 30 Nodes 33 Edges) to train the encoder?
I can only see options to make use of inbuilt datasets in PyTorch gemetric. Is there any simpler way I can use my own datasets?
I have the same question.
+1
Thanks for the question and sorry for the late reply. There is not currently a user-facing mechanism to incorporate custom datasets due to the need to define things like train/test split and subgraph sampling -- in general one can create a new DataSource (see common/data.py) to handle new datasets. Note that a pretrained model (such as the one provided in the repo) may be able to handle testing on new datasets, in which case subgraph_matching/alignment.py can load in new graphs to evaluate on.
If the goal is to train on new datasets, as a bit of a hack, one could append an "elif" after this line: https://github.com/snap-stanford/neural-subgraph-learning-GNN/blob/4d074cbc0fa9d81defef746302e62b1b9a97791d/common/data.py#L55
with a spec for a new dataset:
elif name == 'newdataset': dataset = [list of networkx or pytorch geometric graphs]
and train using the command line option --dataset=newdataset-balanced
and test with --dataset=newdataset-imbalanced
.
Thanks @qema, I was able to train the network using my custom datasets, however, I get only around 70 % validation accuracy. Any suggestions to improve the model accuracy or finetune it? I am using all default model parameters. The second plot depicts validation metrics.
Hi @rd27995, please see the new experimental
branch which supports node features and harder negative sampling. For now, the above procedure to add new datasets is still needed. However, one can now train with --dataset=newdataset-basis
and test with --dataset=newdataset-imbalanced
(-basis
being the new data source with harder negative examples). Also, note that testing on the imbalanced dataset (which samples random pairs of graphs) may give a more realistic picture of model performance than validation (which uses an artificial 50-50 label split as well as artificially-generated negative examples).