ensmallen icon indicating copy to clipboard operation
ensmallen copied to clipboard

Graphs to be added as automatic retrieval

Open LucaCappelletti94 opened this issue 4 years ago • 13 comments

We would like to add some more graphs to the automatic retrieval mechanism.

Currently, we support only StringPPI (human version), CompleteStringPPI (cross-species) and KG-COVID-19.

Which graphs should we add to the list? The requirements for the graph are:

  1. Must be publicly available behind an URL that can be resolved with a wget.
  2. Must be a TSV/CSV/text file with separators.
  3. The server where it is hosted must be reasonably fast.
  4. Can be a zip, gzip, tar.gz or plain file.

LucaCappelletti94 avatar Dec 29 '20 16:12 LucaCappelletti94

Karate

LucaCappelletti94 avatar Jan 07 '21 16:01 LucaCappelletti94

Finish adding graphs from Network Repository,

Most of these are now available, we still need to add support for timestamp graphs and graphs with multi-labelled nodes.

LucaCappelletti94 avatar Jan 14 '21 21:01 LucaCappelletti94

Added graphs from kghub with this pull request.

LucaCappelletti94 avatar Jan 14 '21 21:01 LucaCappelletti94

@LucaCappelletti94 - what about including any of the resources in this (also copied below) table? If you want, we could select a few to start with. I'd be happy to write some simple code that converts them from PyKeen format into the spec you provide above. Let me know what you think!

I put a ➡️ next to the ones I think are worth starting with and a ⭐ next others worth considering for future incorporation.

Name Reference Description
⭐ckg pykeen.datasets.CKG The Clinical Knowledge Graph (CKG) dataset from [santos2020]_.
➡️ codexlarge pykeen.datasets.CoDExLarge The CoDEx large dataset.
codexmedium pykeen.datasets.CoDExMedium The CoDEx medium dataset.
codexsmall pykeen.datasets.CoDExSmall The CoDEx small dataset.
➡️ conceptnet pykeen.datasets.ConceptNet The ConceptNet dataset from [speer2017]_.
⭐cskg pykeen.datasets.CSKG The CSKG dataset.
⭐drkg pykeen.datasets.DRKG The DRKG dataset.
fb15k pykeen.datasets.FB15k The FB15k dataset.
fb15k237 pykeen.datasets.FB15k237 The FB15k-237 dataset.
⭐ hetionet pykeen.datasets.Hetionet The Hetionet dataset is a large biological network.
➡️ kinships pykeen.datasets.Kinships The Kinships dataset.
nations pykeen.datasets.Nations The Nations dataset.
ogbbiokg pykeen.datasets.OGBBioKG The OGB BioKG dataset.
ogbwikikg pykeen.datasets.OGBWikiKG The OGB WikiKG dataset.
➡️ openbiolink pykeen.datasets.OpenBioLink The OpenBioLink dataset.
openbiolinkf1 pykeen.datasets.OpenBioLinkF1 The PyKEEN First Filtered OpenBioLink 2020 Dataset.
openbiolinkf2 pykeen.datasets.OpenBioLinkF2 The PyKEEN Second Filtered OpenBioLink 2020 Dataset.
openbiolinklq pykeen.datasets.OpenBioLinkLQ The low-quality variant of the OpenBioLink dataset.
umls pykeen.datasets.UMLS The UMLS dataset.
wn18 pykeen.datasets.WN18 The WN18 dataset.
wn18rr pykeen.datasets.WN18RR The WN18-RR dataset.
yago310 pykeen.datasets.YAGO310 The YAGO3-10 dataset is a subset of YAGO3 that only contains entities with at least 10 relations.

callahantiff avatar Jan 27 '21 14:01 callahantiff

Also, see the datasets listed in this KG Embedding Review on the bottom of page 22. These are datasets that are most frequently used by people developing new KG embedding methods:

Screen Shot 2021-01-27 at 11 51 46

callahantiff avatar Jan 27 '21 18:01 callahantiff

Thank you, @callahantiff! We still need to add support for multi-class support for the nodes (that is, nodes with multiple classes such as a node that is both of class mammal and class cat). Even though we plan to add support for these and other node and edge features, we will surely work on them after finishing Grape. Do you know if these graphs have multiple classes per nodes or just nodes of multiple classes, with each node of a single class? [UPDATE 2021/04/19] We have support for multi-labeled nodes in graphs now!

If it's the second option, then we can surely support now all the considered graphs. How hard is it to convert them into a CSV-like format? And more importantly, where could we host these? Maybe on kg-hub? Would that be an option @justaddcoffee?

LucaCappelletti94 avatar Jan 28 '21 13:01 LucaCappelletti94

My general feeling is that we can and should allow easy ingest of remote graphs as we are discussing here.

But, I think we should avoid hosting other people's graphs on KG-hub unless they are transformed versions that we are incorporating into our own knowledge graphs (like our ChEMBL transform that we include in KG-COVID-19).

Glad to discuss though

justaddcoffee avatar Jan 28 '21 16:01 justaddcoffee

Import graphs, after adding support for time intervals, from http://www.sociopatterns.org/

LucaCappelletti94 avatar Feb 03 '21 21:02 LucaCappelletti94

Hey @LucaCappelletti94 , as discussed earlier here's the link for kg-microbe graphs: https://kg-hub.berkeleybop.io/kg-microbe/20210422/kg-microbe.tar.gz

hrshdhgd avatar Apr 27 '21 13:04 hrshdhgd

Thank you @hrshdhgd!

LucaCappelletti94 avatar Apr 27 '21 14:04 LucaCappelletti94

Hi @hrshdhgd, sorry for the long wait, now all versions of KG-Microbe and KG-COVID are integrated in the automatic retrieval.

LucaCappelletti94 avatar Jul 30 '21 07:07 LucaCappelletti94

No problem @LucaCappelletti94 , thank you very much!

hrshdhgd avatar Jul 30 '21 14:07 hrshdhgd

I am now iterating once more on the graphs from the automatic graph retrieval (we are now at over 80K graphs downloadable). Do you have more suggestions?

LucaCappelletti94 avatar Jun 14 '22 09:06 LucaCappelletti94