dipper icon indicating copy to clipboard operation
dipper copied to clipboard

Merging equivalent Associations in SciGraph

Open mbrush opened this issue 8 years ago • 8 comments

The DIPper pipeline creates oban:Associations for each G2P link it ingests from each data source, dumping these into output ttl files that get loaded into SciGraph. These Associations represent an 'assertion' as made by a specific source or database, but it is possible that more than one source will assert the same Association. For example, MGI and IMPC might both assert the Association between the same genotype (e.g. 'Rn/Rn [C57BL/6) and phenotype (e.g. MP:0000372 ! 'randomly distributed white hairs'), leading to 'equivalent Associations being dumped into SciGraph (Figures [A] and [B] at the bottom of the ticket diagram rdf representations of these Associations). These represent the same underlying Association or fact, as made in two different assertions (one by MGI, one by IMPC).

For purposes of more efficient queries and data operations, it make sense to collapse these under one Association, and maintain the provenance of the separate assertions in the evidence lines that support the Association. (One model for doing this is diagrammed in [C] below - although there are alternative models for how this merge might look.).

The question is, assuming we want to perform this merge, how and where would we perform it? Equivalent Associations from different sources don’t 'meet' each other until they have left DIPper and entered SciGrpah where data across all sources is aggregated. Some post-DIPper processing step needs to happen at a point after all data that could possibly contain equivalent associations are aggregated.

I will toss out some alternative approaches for discussion (bearing in mind my naivety as to the technical feasibility and efficiency of these options):

  1. Prior to dumping into Scigraph, aggregate all ttl files that could contain equivalent Associations and using something like SPARQL transforms to identify equivalent associations and create new triples materializing the merged association.
  2. After loading into SciGraph, use whatever tools would work in this setting to identify and merge equivalent Association.

Some issues to consider:

  1. Equivalent Associations are recognized by having the same subject genotype, predicate, and object phenotype/disease. And in some cases where environment or stage information is provided (e.g. ZFIN), these represent additional identity criteria for Associations that must also be matched. Recognizing equivalent phenotypes/conditions is straightforward, even if recorded using different identifiers, given the equivalency mappings we have in MONDO and Upheno. But recognizing equivalent genotypes is more challenging, given that they may have different syntax in their labels and different identifiers from their respective sources. And when environment or stage info is included in the Association, we must also consider how to recognize equivalency here.
  2. While it will be exceedingly rare in practice that different model organism data sources make assert the exact same Association between the same G and P, it is quite common in human G2P data to see the same Association asserted by many sources. This is exemplified in ClinVar, which aggregates assertions (SCVs) from many databases/sources, where many represent the same general Association. The example here shows an 'pathogenic for' Association between the variant NM_000059.3(BRCA2):c.5946del and the disease Breast-ovarian cancer, familial that is asserted by seven sources (each of which may base its assertion on different evidence and criteria). Our approach for merging such assertions of the same Association should consider use cases around filtering for data that excludes assertions from specific sources/organizations, or includes only Associations based on evidence of a given type. Where we perform the merge should accommodate such assertion/evidence level queries being performed that may require pre-merged data (or the merged data should be constructed so as to accommodate such assertion/evidence level queries).

FIGURES

[A] MGI assertion

mgi association

[B] IMPC Assertion of same Association

impc association

[C] MGI and IMPC assertions merged under same Association

merged mgi-impc association

mbrush avatar Feb 24 '16 18:02 mbrush