biolink-model icon indicating copy to clipboard operation
biolink-model copied to clipboard

Add dataset descriptions to model

Open cmungall opened this issue 5 years ago • 3 comments

Following on from @cbizon's suggestion today

We can follow the HCLS dataset description model than @micheldumontier worked on.

We are currently refactoring the monarch dataset descriptions to conform to this: https://github.com/monarch-initiative/dipper/issues/792

Basic idea is that we have 3 levels {distribution, version, abstract} and either {source,derived} (for pre or post ETL). Ideally assertions are linked to derived-distribution but we should allow linking to any to make it easy for others in Translator.

E.g. if you want to just make a link between assertion and abstract source, e.g. "DrugBank" you can do that.

Note that the Translator ecosystem has multiple levels of indirection, it should be possible to traverse recursively through derivedFroms

In some cases, a Translator KG will use an API rather than dataset as source, so we should sufficiently abstract here

cmungall avatar Aug 13 '19 20:08 cmungall

For comparison

https://schema.org/DataCatalog https://schema.org/Dataset https://schema.org/DataDownload

clearly corresponds to HCLS levels

image

cmungall avatar Aug 13 '19 20:08 cmungall

We are using HCLS descriptive metadata to describe our ncats-red-kg

We have SPARQL queries to automatically generate descriptive statistics for a defined graph (number of instance and properties, which relations between the entities...). Check out the README.md: https://github.com/MaastrichtU-IDS/data2services-transform-repository/tree/master/sparql/compute-hcls-stats

Note that, in our triplestore, each graph ("context" in GraphDB) correspond to a different data source (drugbank, hgnc, pathwaycommons...)

In the README.md you can also find 2 queries that describe the content of each graph from a triplestore, where graphs has been properly described as HCLS datasets and statistics has been computed. One for the generic content of each graphs (number of statements, entities, properties, classes) and one that give insight about which relations can be found between which entities in each graph (e.g. drugbank 100.000 bl:Drug bl:affects 1.000 bl:OrganismalEntity)

We are also working on making sure the metadata provided are sufficient to make the dataset FAIR. And would like our workflows (CWL, Argo) to automatically enrich those metadata depending on the operations performed on the data (to keep track of the data transformations).

Is there already a minimal set of metadata we already need to provide in the Translator project for each integrated data sources? Maybe we should make sure we agree on a how to describe our graphs and datasets?

It could be interesting if every data sources could provide extended HCLS metadata. e.g. to know by which relations are linked entities in the graphs would allows us to answer questions like "I would like to know in which graph I can find drug to gene downregulation association and how many of them"

If you have any questions about the SPARQL query to compute the HCLS descriptve statistics feel free to contact me, it should be pretty straightforwards, but we might need to update a few things.

You can find an example of SPARQL queries to generate HCLS metadata for PharmGKB before running the compute metadata on the PharmGKB graph URI (described by : dcat:accessURL at the RDF distribution level at the moment, it will probably change) :

I can do a short presentation next tuesday to show what we have at the moment if you want. I am working on a web GUI to be deployed over a triplestore where graphs are described using HCLS descriptive statistics to allow easy exploration of this triplestore

vemonet avatar Aug 14 '19 11:08 vemonet

What is the status of this? Is this still wanted?

nlharris avatar Aug 26 '21 23:08 nlharris

closing for now as work progresses on "supporting datasets" and related properties. We also have our Information Resource (infores) properties.

sierra-moxon avatar Nov 08 '22 00:11 sierra-moxon