biolink-model
biolink-model copied to clipboard
Modelling Gene to Gene Family relationships (are additional Biolink Association and predicates needed?)
Gene orthology knowledge curation (i.e. from the Panther database) relates gene instances to gene families. It's relatively easy to infer guess what concept nodes need to be captured, i.e. biolink:Gene
and biolink:GeneFamily
(to start).
However, the biolink:GeneFamily
concept currently seems a bit disconnected from any biolink:Association
class.
Do we need to define a new biolink:GeneToGeneFamilyAssociation
or alternately, would we simply just connect biolink:Gene
to their biolink:GeneFamily
using the biolink:has_attribute
slot? I guess it depends on the use cases and to the extent that biolink:GeneFamily
instances are annotated with links and related data.
As first class nodes, perhaps such annotation would be easily available in the knowledge graph. On the other hand, it may simply suffice to tag the biolink:Gene
with the gene family identifier (e.g. PANTHER.FAMILY curie) then expect end users to access the related knowledge from outside of the graph (e.g. via a link in the UI?).
That said, reasoning over gene families likely involves transferring (GO term?) molecular function, biological process and cellular component inferences across species boundaries (e.g. from genes in model species to human genes). Graph reasoning engines (e.g. TRAPI wrapped?) might find this task easier if the biolink:Gene
to biolink:GeneFamily
relationship is modelled with first class knowledge graph nodes and edges (i.e. a biolink:Association
). Also, biolink:GeneFamily
instances may have subclassing hierarchical (i.e. subfamily) relationships to one another, thus, biolink:GeneFamily
to biolink:GeneFamily
instances of biolink:Association
may also be posited.
One counterpoint argument is simply that such edges may needlessly(?) clutter up the graph somewhat with as many additional edges as there are genes. That said, some knowledge graphs may have use cases supported by such knowledge representations.
Assuming the latter situation, biolink:GeneFamily
instances will be documented as first class concept nodes, and perhaps, we would add a new biolink:GeneToGeneFamilyAssociation
class to assert set membership of genes into such families.
The next question to arise is which biolink:predicate
should be used in such associations?
The biolink:related_to
seems a bit too general.
A biolink:GeneFamily
could be construed as a kind of conceptual grouping of genes. This suggests that these are associations anchored on biolink:related_to_at_concept_level
or perhaps, one of the child predicates - biolink:narrow_match
or biolink:subclass_of
could be applied - but perhaps these predicates don't quite seem totally appropriate.
Within the biolink:related_to_at_instance_level
predicate space, some terms could apply if their English language definition is loosely assumed, but the strict Biolink Model scoping of the definitions of most (all?) such terms seem to exclude them from consideration. For example, the definition of biolink:part_of
says "...holds between parts and wholes (material entities or processes)...". but a gene family is not really a "whole" of a material entity or process.
Perhaps another more fruitful perspective is to image that biolink:related_to_at_concept_level
is still an appropriate space within which a suitable predicate should be found, and that one major aspect of the "concept level" space is set theoretic in nature. For example, biolink:narrow_match
or biolink:subclass_of
define subsets of a conceptual space based on specific attributes.
However, in set theory, one also has the parallel concept of set membership. Perhaps what is needed in the Biolink Model are simple predicates for set membership. In fact, RO already has mappings to such terms. They are:
-
member_of
(RO:0002350) -
has_member
(RO:0002351)
These do have the unusual characteristic of spanning the conceptual and instance spaces, in that a set is conceptual but can obviously aggregate instances. That said, adding them as child predicates under biolink:related_to_at_concept_level
could be helpful to the current use case of predicates appropriate to model biolink:Gene
to biolink:GeneFamily
and biolink:GeneFamily
to biolink:GeneFamily
relationships (although the latter relationship could still perhaps be modelled as a biolink:subclass_of
relationship?)
What working group (or team) did this request originate from?
The need for this change originates from the Monarch Initiative but is also likely needed for future iterations of the SRI Reference Graph (which is still essentially a derivative of the Monarch knowledge graph).
relevant:
- #691
@putmantime @RichardBruskiewich - is this a blocker for Monarch?
I've just started working on the Gene Orthology ingest into Monarch (using Panther data) so it is a kind of a blocker.
Let's also discuss panther.node IDs here as well. These would be related transitively via evolutionary descent, and horizontally via homology relations. This is in line with the GO/Panther interpretation.
As for the relationship between genes, proteins, nodes, and families: member-of is a good name for this relation but RO has quite strict semantics here.
When choosing any term (class or relation) it's always a good idea to check the hierarchy:
http://purl.obolibrary.org/obo/RO_0002350
RO has a strict mereological view of membership, there isn't really a physical structure existing in space that is the collection of all present and ancestral SHH genes for example.
There is an argument to be made for using subclass_of - this would be consistent with treating PRO family level terms as equivalent to panther families, and also works for relating to subfamilies as well.