the-fair-cookbook icon indicating copy to clipboard operation
the-fair-cookbook copied to clipboard

How to cross reference datasets

Open AlasdairGray opened this issue 3 years ago • 12 comments

Table of Contents

  1. Main FAIRification Objectives
  2. User Stories
  3. Capability & Maturity Table

Main Objectives

The main purpose of this recipe is:

Enable a dataset to provide cross references to other datasets in order to increase the data linkage

By adding data links to a dataset, the data becomes more interoperable as it provides links to equivalent objects in other datasets.


User Stories

Capability & Maturity Table

Capability Initial Maturity Level Final Maturity Level
Interoperability ??? repeatable

License:

AlasdairGray avatar Sep 07 '20 09:09 AlasdairGray

Thanks Alasdair! I will bring in @mcourtot and @FuqiX who can provide some user scenarios from the BioSamples database, and I suspect @weiguUL and @daniwelter may be able to provide some use cases from the IMI data catalog...?

tburdett avatar Sep 07 '20 20:09 tburdett

Hi, Alasdair.

In BioSamples we support two types of data cross-references, linking to other samples by adding sample relationships, and linking to external datasets(ENA, EGA, ArrayExpress, etc) using dataset URLs.

Example here

Users can find related datasets in each sample record and search for samples which have external data in ENA, etc.

We also have a graph search function to support querying related samples. An example query would be, "find all tissue samples that are derived from a donor sample and have data in ENA."

FuqiX avatar Sep 08 '20 05:09 FuqiX

Hi all,

The IMI data catalog currently doesn't have much in the way of cross-referencing but we're working on moving to the DATS model (or some flavour thereof). In DATS, most entities a set of optional fields called identifier, alternate identifier and related identifier. The latter addresses exactly this cross-referencing use case so I hope we'll be able to provide some examples soon. In the meantime, I think ArrayExpress has some very good examples as well.

daniwelter avatar Sep 09 '20 09:09 daniwelter

Inside EBI we've also been thinking about how to standardise "linkset" relationships so e.g. links between samples and data archives can be defined and exchanged between services (@mcourtot might want to chip in here)... and this is also something we've discussed previously in a BioSchemas context. Some guidance on standardisation, around how we capture, describe and exchange links between metadata records and datasets so that resources can exchange them and users can take advantage of them in queries would be a really worthwhile goal

tburdett avatar Sep 09 '20 10:09 tburdett

We should probably mention the ongoing OBO FOundary work in this area as well that is defining the SSSOM guidelines for exchanging cross-references as a csv file.

AlasdairGray avatar Sep 09 '20 10:09 AlasdairGray

+1 @AlasdairGray , something @Chris-Evelo mentioned too.

proccaserra avatar Sep 09 '20 10:09 proccaserra

@AlasdairGray @egonw @Chris-Evelo @nicklynch This relates to 3.3.1.5. Vocabulary mapping - semantic in the FAIRcookbook.

I can see already several possible sections or several recipes.

  1. A generic recipe detailing the nature of the problem: entity mapping + ontology mapping Here, the insight from the Pistoia Alliance Mapping project would be nice to have. @nicklynch what about liaising with Ian Harrow on this issue to create content?

  2. Applied examples

  • [OXO] Following on the other recipes presented EBI resources (OLS, Zooma), @tburdett @FuqiX @mcourtot, can you cover that aspect by framing this in the context of an IMI project we need to Fairify ( EBISC /Biosample I guess is what you pointed at already?

  • Another example from tools used in Industry: Again @nicklynch, can we build on what already exists?

  • Chemical Entities /. Metabolites: here @egonw @Chris-Evelo, owing to your involvement in wikidata and experience, can I put you down to this one ? if so, can you provide me a timeline?

Thank you all

proccaserra avatar Sep 14 '20 15:09 proccaserra

Just rediscovered this ticket again.

@proccaserra we should keep a separation of concerns around schema/ontology mappings alignments and data instance equivalences (database cross-references). Each has their own exchange formats and services.

AlasdairGray avatar Nov 05 '20 15:11 AlasdairGray

@AlasdairGray sure but this is for you and @Chris-Evelo to clarify this bit. The issue was created based on very vague description. Both are in charge on these recipes, feel free to rearrange. Are you still both good on the deadline (end of November) ? many thx

proccaserra avatar Nov 05 '20 15:11 proccaserra

Describing the difference and the different needs should not be so hard. And we can likely do that in time indeed.

What comes next are 3 things:

  1. Database identifier mapping (using BridgeDb/BioMart). Lucas will start working next week, which will make a big difference. But the end of November might be a bit close. Lucas was just added to the project and can thus now access the drive. He will also join the squad and cookbook meetings.
  2. Ontology mapping. I asked Tony about that since that of course links to OxO. He had some ideas on how to get that done. As soon as there is a basic recipe Lucas can actually run some tests and evaluate in more practical pipelines.
  3. Develop some actual pipelines (Jupityr notebooks) to use these base recipes we will do after at least one of these steps is finished. I discussed with Nick to see whether we can get decide on some relevant cases (e.g. across two datasets) during the next squad face to face.

Chris-Evelo avatar Nov 05 '20 16:11 Chris-Evelo

I have completed a first draft of the data identifier mapping recipe. https://github.com/FAIRplus/the-fair-cookbook/blob/id-map-services/docs/content/recipes/interoperability/identifier-mapping.md

AlasdairGray avatar Nov 18 '20 10:11 AlasdairGray

I've finished the first draft of the OxO recipe. Please find the updates in this issue

FuqiX avatar Nov 18 '20 13:11 FuqiX