biolink-model icon indicating copy to clipboard operation
biolink-model copied to clipboard

Modelling sets in Biolink

Open RichardBruskiewich opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe.

The best practices for modelling sets of things in Biolink remains somewhat undefined or, at best, heterogeneously defined.

What working group (or team) did this request originate from?

The Chemical Working Group (under leadership of Vlado Dancik) discussed the situation in their call of February 22nd, 2022. Some extant Biolink Model issues reflect facets of this 'set modelling' issue may include the following (@vdancik, please add some more...):

  • Issue #385
  • Issue #944
  • Issue #314

Describe the solution you'd like

A unified Biolink Model and best practices approach may help here on several levels:

  • Biolink Model strategy: do we model sets as special model syntax, mixins, or separate categories (see next bullet point)
  • Biolink Model (node concept) categories: would a parent category mathematical set with 'major domain' specialist child categories (e.g. GeneFamily, ChemicalFamily, DiseaseFamily, etc.) be helpful for knowledge representation and queries (or not)?
  • BIolink Model predicates: Issue #944 suggests the idea of has member and member of predicates in the related to at concept level: predicate sub-hierarchy

Tag relevant members for discussion

@vdancik @sierra-moxon @cmungall @cbizon

Examples possible model:

mathematical set:
  mixin: true
# a bit of a stretch of Biolink Model subclassing semantics here... 
# alongside the mixin, convention constrains instances of gene family to be sets only contain genes, 
# and that every member of the gene set satisfies the truth value of statements using the set as a subject or object?
gene set:
   is_a: gene   
   mixins:
   - mathematical set

gene family:
   is_a: gene set
   description: gene set defined by gene homology

chemical set:
    is_a: chemical entity
   mixins:
   - mathematical set

An alternate modelling approach might make mathematical set a first class node category, and simply constrain semantics of set members by associations, namely:

mathematical set:
  is_a: named thing

gene to gene family association:
    is_a: association
    defining_slots:
      - subject
      - predicate
      - object
    slot_usage:
      subject:
        range: gene
      predicate:
         subproperty_of: member of
      object:
        range: mathematical set

A third 'out-of-the-box' idea from Vlado:

Borrow from java and use [ ] after Biolink class e.g. biolink.MolecularEntity[] to indicate that a node represents a set. This way we don't need to change the model, just add a little convention

Q: how and where would this be used?

Would it be a Biolink Model mode markup? e.g.

Gene - member_of -> Gene[]

that is, use range: gene[] in the gene to gene family association above instead of range: mathematical set

or would it be a Biolink Model mode markup? i.e.

 gene family:
    is_a:  gene[]

One last thought is whether or not this is something to be generically handled at the LinkML level?

Part of the reasoning here is that the semantic notion of a class hierarchy seems to be a more specific constrained notion of a mathematical set i.e. mathematical sets are somewhat agnostic about the criteria for set membership, just that an instance is 'in' or 'out' of a given collection, and is unique within that collection, whereas 'class' collect instances based on specific instance attributes, and the class hierarchy divides instances into child sub-classes based on general to specific member attributes.

RichardBruskiewich avatar Feb 22 '22 22:02 RichardBruskiewich

I think a lot of the relevant discussion is in the linked ticket

I don't think biolink should have a mathematical set class.

The relevant discussion here is in the issue you linked

  • #943

Briefly to summarize: There is a fundamental decision that needs to be made that needs agreement between the "ontology school" and "database school"

  1. to the ontology school, L-cysteine, cysteine, alpha amino acid, amino acid etc are all molecule classes, just at different levels of specificity
  2. to the database school, L-cysteine, possibly cysteine are "entities" and the others are groupings

The same pattern applies to various other things like genes and gene familiies, and we have our ubiquitous "eukaryotic protein" example

In neither case do we need a "mathematical set" class. However, if we implement 2, then we may want a class that is something like "grouping class". But the advocates of this scheme need to think this through and work through exactly how this would work. Would the grouping class hierarchy parallel the entity hierarchy? Would we make conflation classes to allow either e.g. molecule or molecule grouping class to be used as domain/range for some relations?

In fact I have done exactly this with chemrof:

https://chemkg.github.io/chemrof/

there is a lot of material here but it is will thought out and the database approach has been implemented rigorously, happy to walk through this on a call. There is a lot of material here

cc @mikebada @mbrush @sierra-moxon @cbizon @balhoff

cmungall avatar Feb 22 '22 23:02 cmungall

#1099

sierra-moxon avatar Oct 03 '22 17:10 sierra-moxon

See also #385

nlharris avatar Feb 20 '24 22:02 nlharris

Is there still a need for this now we can model the sets in trapi?

On Tue, Feb 20, 2024 at 2:18 PM Nomi Harris @.***> wrote:

See also #385 https://github.com/biolink/biolink-model/issues/385

— Reply to this email directly, view it on GitHub https://github.com/biolink/biolink-model/issues/950#issuecomment-1955212923, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOOATGYMFGCXWI2Z7O3YUUOKFAVCNFSM5PCUUFE2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJVGUZDCMRZGIZQ . You are receiving this because you were mentioned.Message ID: @.***>

cmungall avatar Feb 22 '24 11:02 cmungall