biolink-model Modelling Gene to Gene Family relationships (are additional Biolink Association and predicates needed?)

Modelling Gene to Gene Family relationships (are additional Biolink Association and predicates needed?)

Open RichardBruskiewich opened this issue 2 years ago • 4 comments

Gene orthology knowledge curation (i.e. from the Panther database) relates gene instances to gene families. It's relatively easy to infer guess what concept nodes need to be captured, i.e. biolink:Gene and biolink:GeneFamily (to start).

However, the biolink:GeneFamily concept currently seems a bit disconnected from any biolink:Association class.

Do we need to define a new biolink:GeneToGeneFamilyAssociation or alternately, would we simply just connect biolink:Gene to their biolink:GeneFamily using the biolink:has_attribute slot? I guess it depends on the use cases and to the extent that biolink:GeneFamily instances are annotated with links and related data.

As first class nodes, perhaps such annotation would be easily available in the knowledge graph. On the other hand, it may simply suffice to tag the biolink:Gene with the gene family identifier (e.g. PANTHER.FAMILY curie) then expect end users to access the related knowledge from outside of the graph (e.g. via a link in the UI?).

That said, reasoning over gene families likely involves transferring (GO term?) molecular function, biological process and cellular component inferences across species boundaries (e.g. from genes in model species to human genes). Graph reasoning engines (e.g. TRAPI wrapped?) might find this task easier if the biolink:Gene to biolink:GeneFamily relationship is modelled with first class knowledge graph nodes and edges (i.e. a biolink:Association). Also, biolink:GeneFamily instances may have subclassing hierarchical (i.e. subfamily) relationships to one another, thus, biolink:GeneFamily to biolink:GeneFamily instances of biolink:Association may also be posited.

One counterpoint argument is simply that such edges may needlessly(?) clutter up the graph somewhat with as many additional edges as there are genes. That said, some knowledge graphs may have use cases supported by such knowledge representations.

Assuming the latter situation, biolink:GeneFamily instances will be documented as first class concept nodes, and perhaps, we would add a new biolink:GeneToGeneFamilyAssociation class to assert set membership of genes into such families.

The next question to arise is which biolink:predicate should be used in such associations?

The biolink:related_to seems a bit too general.

A biolink:GeneFamily could be construed as a kind of conceptual grouping of genes. This suggests that these are associations anchored on biolink:related_to_at_concept_level or perhaps, one of the child predicates - biolink:narrow_match or biolink:subclass_of could be applied - but perhaps these predicates don't quite seem totally appropriate.

Within the biolink:related_to_at_instance_level predicate space, some terms could apply if their English language definition is loosely assumed, but the strict Biolink Model scoping of the definitions of most (all?) such terms seem to exclude them from consideration. For example, the definition of biolink:part_of says "...holds between parts and wholes (material entities or processes)...". but a gene family is not really a "whole" of a material entity or process.

Perhaps another more fruitful perspective is to image that biolink:related_to_at_concept_level is still an appropriate space within which a suitable predicate should be found, and that one major aspect of the "concept level" space is set theoretic in nature. For example, biolink:narrow_match or biolink:subclass_of define subsets of a conceptual space based on specific attributes.

However, in set theory, one also has the parallel concept of set membership. Perhaps what is needed in the Biolink Model are simple predicates for set membership. In fact, RO already has mappings to such terms. They are:

member_of (RO:0002350)
has_member (RO:0002351)

These do have the unusual characteristic of spanning the conceptual and instance spaces, in that a set is conceptual but can obviously aggregate instances. That said, adding them as child predicates under biolink:related_to_at_concept_level could be helpful to the current use case of predicates appropriate to model biolink:Gene to biolink:GeneFamily and biolink:GeneFamily to biolink:GeneFamily relationships (although the latter relationship could still perhaps be modelled as a biolink:subclass_of relationship?)

What working group (or team) did this request originate from?

The need for this change originates from the Monarch Initiative but is also likely needed for future iterations of the SRI Reference Graph (which is still essentially a derivative of the Monarch knowledge graph).

Jan 25 '22 22:01 RichardBruskiewich

relevant:

#691

Jan 25 '22 22:01 cmungall

@putmantime @RichardBruskiewich - is this a blocker for Monarch?

Jan 25 '22 22:01 sierra-moxon

I've just started working on the Gene Orthology ingest into Monarch (using Panther data) so it is a kind of a blocker.

Jan 25 '22 22:01 RichardBruskiewich

Let's also discuss panther.node IDs here as well. These would be related transitively via evolutionary descent, and horizontally via homology relations. This is in line with the GO/Panther interpretation.

As for the relationship between genes, proteins, nodes, and families: member-of is a good name for this relation but RO has quite strict semantics here.

When choosing any term (class or relation) it's always a good idea to check the hierarchy:

http://purl.obolibrary.org/obo/RO_0002350

RO has a strict mereological view of membership, there isn't really a physical structure existing in space that is the collection of all present and ancestral SHH genes for example.

There is an argument to be made for using subclass_of - this would be consistent with treating PRO family level terms as equivalent to panther families, and also works for relating to subfamilies as well.

Jan 29 '22 00:01 cmungall

biolink-model biolink-model copied to clipboard

Modelling Gene to Gene Family relationships (are additional Biolink Association and predicates needed?)

biolink-model
biolink-model copied to clipboard