linkml icon indicating copy to clipboard operation
linkml copied to clipboard

Extend enum model to allow open vs closed, extensional vs intensional

Open wdduncan opened this issue 4 years ago • 7 comments

From LinkML meeting on 02/26

For emums, I proposed to have a slot named defined_value that is a part of permissible_value. This would allow for data translation cases in which you could only define some of the values, but don't want to generate an exception. permissible_value would be used when you want to enforce the schema.

Related to biolink/linkml#37

cc @cmungall @hrshdhgd @sierra-moxon @deepakunni3 @hsolbrig

wdduncan avatar Feb 26 '21 22:02 wdduncan

can you clarify this @wdduncan not sure I understand

cmungall avatar Oct 04 '21 21:10 cmungall

My reason for suggesting this was to allow for values that were not specified in the enum. Consider an enum like:

GenderEnum:
  permissable values:
    male: 
       meaning: http://xyz.org
    female:
       meaning: http://xyz.org

If you encounter a case in which a gender data field has value "other", do you always want to throw an error? Or (in some cases) do you want to accept the data, although you haven't specified the value in the enum. permissible value is for the former case, defined value would be for the latter case.

wdduncan avatar Oct 12 '21 13:10 wdduncan

Proposal:

Add a slot to EnumDefinition with a name such as is_open

  • If the enum is closed, then values MUST belong to the set of permissible values
  • If the enum is open, then values SHOULD (MAY?) belong to the set of permissible values
  • If the enum is closed, and there are no permissible_values, then this is a schema error (detected when we convert to json schema, but we should have first-class checks)

We should default to closed enums as that is the current semantics (unless codeset is specified - see below)

This satisfies the use case above. A schema designer can include mappings in an enum, e.g for M, F, and these map to IRIs, but data providers MAY provide data with other values that do not map, and this is still valid. If you use a string from the enum then you are committing to that meaning.

Note that for open enums, the mapping to json-schema is simply to use a string rather than json-schema enum.

TBD: In future we will support dynamically obtaining the PVs from an externally defined codeset, e.g. via API/terminology service. If a codeset is provided then when generating json-schema, either this should be treated out of band and modeled as a string OR we can imaging querying the service to obtain the json-schema enum at generation time.

cmungall avatar Oct 15 '21 20:10 cmungall

Alternate proposal: we create two separate enumeration classes in LinkML based on extensional and intensional definitions:

  • ExtensionalEnumeration: needs to explicitly list every permissible value.
  • IntensionalEnumeration: needs to provide some sort of (human or machine) readable definition for what values are permitted, with optional defined value entries to clarify meaning.

gaurav avatar Oct 15 '21 21:10 gaurav

Need to define a standard way of defining an intensional Enum. This could involve things such as:

  • all descendants via is-a of term X in ontology Y (see #274)
  • all terms in subset S of ontology O from version V
  • all ancestors of terms X in ontology Y
  • Would it be too difficult to specify multiple terms (X1, ..., XN)?

cmungall avatar Oct 22 '21 20:10 cmungall

@cmungall Just wanted to give you more information based on what you were asking me about TCCM (Terminology Core Common Model).

I dc'ed for about 10 minutes there so I didn't hear the initial discussion, but I think you were asking if TCCM had any "enum class" that would help with this situation. Again, unfortunately I never got onboarded to TCCM by Harold or Dazhi, and there isn't any proper documentation, just some autogenerated documentation that doesn't really contain any comments or descriptions.

That being said, here's a link to the model definition. It's pretty light-weight: https://github.com/HOT-Ecosystem/tccm-model/blob/main/tccm_model/model/schema/tccm_model.yaml

It doesn't really say anything about "enums". It defines other classes/slots, and they have the property of being multivalued, e.g.:

slots:
  code:
    range: string
    required: true
    description: |-
      The official code of this entry
    slot_uri: skos:notation

  designation:
    description: |-
      The preferred label or text in the context of a particular community or language
    notes:
      - Designation should never be used as an identifier.  They are strictly informative
    range: string
    slot_uri: skos:prefLabel
    multivalued: false   # <---- closest thing in the TCCM model related to "enums"
    required: false

I looked through the codebase otherwise, and there's really nothing in the codebase about "enums" either. Just some imports from linkml.runtime that aren't actually being used:

Screen Shot 2021-10-22 at 5 03 42 PM

joeflack4 avatar Oct 22 '21 21:10 joeflack4

There has been a lot of progress on this.

Does the issue need to remain open?

Who should be the assignee?

turbomam avatar May 17 '24 20:05 turbomam