Extend enum model to allow open vs closed, extensional vs intensional
From LinkML meeting on 02/26
For emums, I proposed to have a slot named defined_value that is a part of permissible_value. This would allow for data translation cases in which you could only define some of the values, but don't want to generate an exception. permissible_value would be used when you want to enforce the schema.
Related to biolink/linkml#37
cc @cmungall @hrshdhgd @sierra-moxon @deepakunni3 @hsolbrig
can you clarify this @wdduncan not sure I understand
My reason for suggesting this was to allow for values that were not specified in the enum. Consider an enum like:
GenderEnum:
permissable values:
male:
meaning: http://xyz.org
female:
meaning: http://xyz.org
If you encounter a case in which a gender data field has value "other", do you always want to throw an error? Or (in some cases) do you want to accept the data, although you haven't specified the value in the enum. permissible value is for the former case, defined value would be for the latter case.
Proposal:
Add a slot to EnumDefinition with a name such as is_open
- If the enum is closed, then values MUST belong to the set of permissible values
- If the enum is open, then values SHOULD (MAY?) belong to the set of permissible values
- If the enum is closed, and there are no permissible_values, then this is a schema error (detected when we convert to json schema, but we should have first-class checks)
We should default to closed enums as that is the current semantics (unless codeset is specified - see below)
This satisfies the use case above. A schema designer can include mappings in an enum, e.g for M, F, and these map to IRIs, but data providers MAY provide data with other values that do not map, and this is still valid. If you use a string from the enum then you are committing to that meaning.
Note that for open enums, the mapping to json-schema is simply to use a string rather than json-schema enum.
TBD: In future we will support dynamically obtaining the PVs from an externally defined codeset, e.g. via API/terminology service. If a codeset is provided then when generating json-schema, either this should be treated out of band and modeled as a string OR we can imaging querying the service to obtain the json-schema enum at generation time.
Alternate proposal: we create two separate enumeration classes in LinkML based on extensional and intensional definitions:
- ExtensionalEnumeration: needs to explicitly list every permissible value.
- IntensionalEnumeration: needs to provide some sort of (human or machine) readable definition for what values are permitted, with optional
defined valueentries to clarify meaning.
Need to define a standard way of defining an intensional Enum. This could involve things such as:
- all descendants via is-a of term X in ontology Y (see #274)
- all terms in subset S of ontology O from version V
- all ancestors of terms X in ontology Y
- Would it be too difficult to specify multiple terms (X1, ..., XN)?
@cmungall Just wanted to give you more information based on what you were asking me about TCCM (Terminology Core Common Model).
I dc'ed for about 10 minutes there so I didn't hear the initial discussion, but I think you were asking if TCCM had any "enum class" that would help with this situation. Again, unfortunately I never got onboarded to TCCM by Harold or Dazhi, and there isn't any proper documentation, just some autogenerated documentation that doesn't really contain any comments or descriptions.
That being said, here's a link to the model definition. It's pretty light-weight: https://github.com/HOT-Ecosystem/tccm-model/blob/main/tccm_model/model/schema/tccm_model.yaml
It doesn't really say anything about "enums". It defines other classes/slots, and they have the property of being multivalued, e.g.:
slots:
code:
range: string
required: true
description: |-
The official code of this entry
slot_uri: skos:notation
designation:
description: |-
The preferred label or text in the context of a particular community or language
notes:
- Designation should never be used as an identifier. They are strictly informative
range: string
slot_uri: skos:prefLabel
multivalued: false # <---- closest thing in the TCCM model related to "enums"
required: false
I looked through the codebase otherwise, and there's really nothing in the codebase about "enums" either. Just some imports from linkml.runtime that aren't actually being used:
There has been a lot of progress on this.
Does the issue need to remain open?
Who should be the assignee?