dgl-ke icon indicating copy to clipboard operation
dgl-ke copied to clipboard

Add Python API

Open zheng-da opened this issue 4 years ago • 15 comments

The Python API is convenient for many use cases. It allows more customization and is very friendly for Jupyter Notebook users.

zheng-da avatar Apr 23 '20 18:04 zheng-da

One thing that would be especially helpful for a Python API would be to create a model class that once trained can do entity and edge prediction (e.g., https://graphvite.io/docs/latest/api/application.html#graphvite.application.KnowledgeGraphApplication). For example, if I have a list of entity nodes and relational edges and want to know either 1) what is the most likely or top-k destination nodes for a set of source nodes or 2) what is the probability that a certain type of edges exists between a source and destination node. Right now, I plan to borrow code for evaluating pre-trained knowledge graph embeddings (https://aws-dglke.readthedocs.io/en/latest/hyper_param.html#evaluation-on-pre-trained-embeddings --> https://github.com/awslabs/dgl-ke/blob/master/python/dglke/eval.py) to try to do this on my own; however, it seems like that would be helpful for downstream tasks for many users. Please let me know if this is something you think would be useful and if I develop such a script I can help share it with your or develop it in a way that works within the dgl-ke library and can be imported. Thanks for your consideration.

AlexMRuch avatar May 03 '20 16:05 AlexMRuch

This is one of our motivations to create a Python API. It'll be great if you could contribute to this. Could you share such a script to us once you have it? I think we should totally work together on this.

We'll share our previous design of the Python API. It'll be great if you can give us feedbacks.

zheng-da avatar May 03 '20 17:05 zheng-da

Happily! Thanks so much for your interest! Should we open another issue for the entity/link prediction ticket? Would you like make to fork dgl-ke or do you want this to be in a feature branch? Also, if you have any suggestions for how we should go about developing this to make it most functional for the library, please let me know. Thanks in advance!

AlexMRuch avatar May 03 '20 17:05 AlexMRuch

My understanding is that you like to contribute to creating a model class to evaluate pre-trained embeddings for various tasks: entity classification and link prediction. Is this right?

We can create another ticket to have more focused discussions. As for development, I think you can fork the repo and make a PR for us. Before that, can we start with discussion on API definition? We like to have API to be stable. So it'll be great if we can finalize the API design before we can go to actual code?

zheng-da avatar May 03 '20 17:05 zheng-da

Yes, that is correct. It would be wonderful to have a model class that can ingest pre-trained embeddings and then perform entity classification and link prediction similar to what graphvite applications do.

Thanks for opening another ticket. I'll likely have questions throughout the process, so that'll help keep this issue ticket cleaner in the event others have ideas or wish to contribute to the Python API. I will fork the repo and can begin work after you all have discussed the API and settled on a stable definition, as you requested. Thanks for the guidance!

AlexMRuch avatar May 03 '20 17:05 AlexMRuch

The Python API is mainly defined for users to invoke KGE training in the Notebook environment. It doesn’t support distributed training.

Load Data

# Load builtin datasets
kg = dglke.dataset.FB15k()
# Load users' own data (raw or pre-formatted data)
kg = dglke.dataset.load(train=load_rdf('/path/to/train/file'),
                        valid=load_rdf('/path/to/valid/file'),
                        test=None,
                        format='htr')

Model load and creation

When a model is created, it has to be associated to a knowledge graph. Since KGE models are transductive, it’s only valid on a knowledge graph.

model = dglke.TransE(dim=400)
model.attach_data(kg)

Model training

When training the model, it only trains on the knowledge graph associated with the model and save the model afterwards. When the model is saved to the disk, we only save the model embeddings and configurations to the disk.

# When training a model, we need to provide the training data and
# specify all hyperparameters.
model.fit(num_epochs=10,
          gpus=[0, 1, 2, 3], batch_size=1000,
          neg_sample_size=400, lr=0.1,
          warm_start=False)
model.save('/path/to/save/model')

Restart model training from a checkpoint

Training knowledge graph embeddings may take a long time. It’s likely that people want to save KGE models periodically and restart the training. We should allow KGE training from a checkpoint.

model = model.TransE(dim=400)
model.load('/path/to/trained/model')
model.attach_data(kg)
model.fit() # This will lead to an error if there is no kg

Model evaluation

model.eval(kg.test, filter_edges=kg.train, neg_size=1000,
           neg_sample_strategy='...')

triplets = load_rdf('..', format='htr')
model.link_prediction(triplets)
model.entity_embed          # get the entity embeddings
model.relation_embed        # get the relation embeddings

zheng-da avatar May 03 '20 17:05 zheng-da

I shared the API we defined a few months ago, but we didn't get time to implement them. I would like to share it with the community and ask for feedbacks.

@AlexMRuch as a user, do you find this kind of API be intuitive for you? As for API for evaluation, is this what you have in mind? Feel free to propose your ideas and give us feedbacks on other APIs. Thanks.

zheng-da avatar May 03 '20 17:05 zheng-da

@AlexMRuch please feel free to open another ticket to discuss the evaluation API.

zheng-da avatar May 03 '20 17:05 zheng-da

Wonderful. Thanks! Given the information you posted above, perhaps we can just continue the API setup and evaluation discussion here, as it seems like this will involve creating the objects you mentioned above.

The API seems pretty clear for me and is very similar to what I had in mind; however, a few things are unclear.

  1. model.save('/path/to/save/model') <-- What does this save if the entity.npy, relation.npy, and config.json files for the pre-trained model already exist? It seems like this should only be evoked if the model class was going to be trained and save the *.npy and config.json files, and if that's the case shouldn't the fit method be between model.attach_data(kg) and model.save('/path/to/save/model')?
  2. The specifics of running a warm up for the model should be clarified. For example, for how many steps does the model run for warming up? Also, this may be a good place to add in a search function to find the best lr value (e.g., https://docs.fast.ai/callbacks.lr_finder.html).
  3. model.predict() should allow for a matrix of canonical tuples as well as individual canonical tuples – correct? Also, does predict here apply to link prediction? If so, maybe it should be renamed to link_prediction so there can also be a entity_classification method. On the other hand, the method could also just take a second argument to be "link", "source", or "destination" and then just expect things to be in a hrt format (or that's a third option that defaults to hrt).

Is there any interest in adding visualization (e.g., reduction to 2D or 3D with UMAP)? That could be another method; however, that would also add another installation requirement for users and may be hard on memory (vs. running it in a new kernel separately on the *.npy files).

Hope those suggestions help and that you get other useful feedback from the community!

Please let me know when you've heard back from others and when you'd like me to try and contribute some code to this effort. Thanks!

AlexMRuch avatar May 03 '20 18:05 AlexMRuch

  1. The Python API will require users to call model.save() to save models explicitly. I think what the confusion was. I have moved save() after fit(). Hopefully, it's clearer now.

  2. Here the warmup is a little different from the warm up strategy used for model training, although you can use it in that way. Here we just want to give users an option to continue training a model from a checkpoint saved previously. I have changed it in the API definition.

  3. thanks for your suggestion. I have updated the API and call it link prediction. However, how do we do entity classification? Should we train a classification model on top of embeddings first?

Yes, visualization is definitely desired. Do you have any suggestions on what are the good visualization tools out there for large graphs?

zheng-da avatar May 04 '20 02:05 zheng-da

  1. The Python API will require users to call model.save() to save models explicitly. I think what the confusion was. I have moved save() after fit(). Hopefully, it's clearer now.

Yes, much clearer! Thank you. I agree that calling model.save() explicitly is what people should do.

  1. Here the warmup is a little different from the warm up strategy used for model training, although you can use it in that way. Here we just want to give users an option to continue training a model from a checkpoint saved previously. I have changed it in the API definition.

Ah, yes, that's a great idea – restarting training from a checkpoint. This will be very useful for how I plan to use dgl-ke in my work.

If we wanted to add something like lr_finder to the model later, that's an option, but definitely seems like less of a priority compared to the other things that need to be done first.

  1. thanks for your suggestion. I have updated the API and call it link prediction. However, how do we do entity classification? Should we train a classification model on top of embeddings first?

What I mean by link prediction is where you have a source entity and a destination entity and you want to predict whether a particular kind of relation edge exists between them (i.e., are two Twitter users connected by a Retweet edge?).

What I mean by entity prediction is where you have a source entity and a relation edge and you want to predict the most likely destination entity for that source-relation pair (i.e., who is a given Twitter user most likely to retweet, or what is the list of top-k most likely destination entities?). So this is not really a "classification" problem – my mistake. It should be called entity_prediction and would probably just be a KNN problem given the source entity and relational edge.

I hope that makes sense and that you agree with these ideas. I believe the idea of using KNN for entity prediction is why graphvite required installing the faiss library.

Yes, visualization is definitely desired. Do you have any suggestions on what are the good visualization tools out there for large graphs?

In my work I usually do UMAP to knock the embeddings dimensions down to 2 or 3 dimensions and then just use matplotlib or seaborn to visualize the results. I've done this over 5 million nodes embedded with metapath2vec and it worked fine (see Figures 5 and 10 in https://arxiv.org/pdf/2001.01126.pdf).

I haven't visualized knowledge graphs yet, however, so I don't know what extra complexity will exist for their embeddings compared to metapath2vec (e.g., do we need to do anything in particular to account for the relation embeddings if we wanted to jointly visualize entity and relation embeddings or should we just map both sets of embeddings separately).

For what it's worth, I also plan on using HDBSCAN to cluster the entity embeddings in my work (I'll probably reduce the dimensions to < 64 with UMAP first, though, to improve the compute time of the HDBSCAN algorithm. I did that on the same network I'm running the KGE model on now (with 10M entities and 100M edges) and it returned very nice results in addition to nice UMAP 2D and 3D visualizations: https://www.graphika.com/posts/deep-learning-at-graphika-scaling-network-maps-with-heterogeneous-graph-embedding/.

AlexMRuch avatar May 04 '20 14:05 AlexMRuch

lr_finder seems useful. I think we'll include it in a future release. I was reading how to pick the right learning rate (but for a very different purpose) and came across this function. I'm wondering how effective in your experience for KGE training?

Thanks for your suggestions on visualization. Your visualization looks very cool. The team will investigate visualization tools and try out the ones you suggested. I think we'll need your help.

zheng-da avatar May 04 '20 15:05 zheng-da

lr_finder seems useful. I think we'll include it in a future release. I was reading how to pick the right learning rate (but for a very different purpose) and came across this function. I'm wondering how effective in your experience for KGE training?

Sounds great! I haven't used it for KGE before but have used it for other tasks (e.g., multi-label NLP classification tasks). I presume it should port over relatively easy to KG tasks given that you can define the problem the same: which learning rate minimizes the loss best.

Thanks for your suggestions on visualization. Your visualization looks very cool. The team will investigate visualization tools and try out the ones you suggested. I think we'll need your help.

You're very welcome! Very happy to help where I can. Please don't hesitate to reach out!!

AlexMRuch avatar May 04 '20 17:05 AlexMRuch

I am no longer working at the same company where I used this library.

On Wed, Jul 14, 2021, 8:09 PM Nabila Abraham @.***> wrote:

hi @zheng-da https://github.com/zheng-da and @AlexMRuch https://github.com/AlexMRuch - is there any progress on this feature request?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/dgl-ke/issues/82#issuecomment-880290613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIYWOM455U4JIZOZF4WWLDTXYRMFANCNFSM4MPKYIAA .

AlexMRuch avatar Jul 15 '21 11:07 AlexMRuch

Hey guys,

i couldn't figure if the API is already released or not? I assume not. I really liked how to defined the API above. Also i would suggest that for training of a users’ knowledge graph the input would be dirctly an RDF Graph using the RDFlib. Would that be possible and what date can one except the API to be released?

Best regards Chris

The Python API is mainly defined for users to invoke KGE training in the Notebook environment. It doesn’t support distributed training.

Load Data

# Load builtin datasets
kg = dglke.dataset.FB15k()
# Load users' own data (raw or pre-formatted data)
kg = dglke.dataset.load(train=load_rdf('/path/to/train/file'),
                        valid=load_rdf('/path/to/valid/file'),
                        test=None,
                        format='htr')

Model load and creation

When a model is created, it has to be associated to a knowledge graph. Since KGE models are transductive, it’s only valid on a knowledge graph.

model = dglke.TransE(dim=400)
model.attach_data(kg)

Model training

When training the model, it only trains on the knowledge graph associated with the model and save the model afterwards. When the model is saved to the disk, we only save the model embeddings and configurations to the disk.

# When training a model, we need to provide the training data and
# specify all hyperparameters.
model.fit(num_epochs=10,
          gpus=[0, 1, 2, 3], batch_size=1000,
          neg_sample_size=400, lr=0.1,
          warm_start=False)
model.save('/path/to/save/model')

Restart model training from a checkpoint

Training knowledge graph embeddings may take a long time. It’s likely that people want to save KGE models periodically and restart the training. We should allow KGE training from a checkpoint.

model = model.TransE(dim=400)
model.load('/path/to/trained/model')
model.attach_data(kg)
model.fit() # This will lead to an error if there is no kg

Model evaluation

model.eval(kg.test, filter_edges=kg.train, neg_size=1000,
           neg_sample_strategy='...')

triplets = load_rdf('..', format='htr')
model.link_prediction(triplets)
model.entity_embed          # get the entity embeddings
model.relation_embed        # get the relation embeddings

ChrisDelClea avatar Jul 27 '22 11:07 ChrisDelClea