Null value terminology
Previously we have been encouraging the use of the INSDC set of null terms when required:
| term | definition |
|---|---|
| not applicable | information is inappropriate to report, can indicate that the standard itself fails to model or represent the information appropriately |
| missing-not collected | information of an expected format was not given because it has not been collected |
| missing- not provided | information of an expected format was not given, a value may be given at the later stage |
| missing - restricted access | information exists but can not be released openly because of privacy concerns |
These have worked OK when used, so there may not be any need to change things, but I wanted to raise this option for discussion. Could /should we be using the HL7 set of "Null Flavors" instead: https://terminology.hl7.org/3.0.0/ValueSet-v3-NullFlavor.html
As these are more comprehensive than the INSDC set.
Do we want the same value set for each field?
Do we always allow missing values? E.g. if a field is required, so we allow missing value enums?
Hello @only1chunts - the null standards were devised with the GSC in the CIG working group and we agreed to adopt and promote their use with the INSDC. If the HL7 values are in wide use/useful to our community, I would suggest, we look to see how they map to the INSDC terms, and add them as synonyms where possible.
Cheers, Lynn
@lschriml , I'm good with just using the set of 4 simple terms that we already encourage, I'm just raising the possibility that there are other options out there that are more comprehensive in their coverage. @cmungall , Where null values are permissible I think we should be expecting the same set of 4 options as listed above. As for which terms/slots we allow null values.... thats probably a bigger discussion! Does a valid null value fulfill the mandatory requirement? in some cases maybe it does. Where we specify a CV/enum then we should probably start checking which of the null values are permissible and including those in the list.
What are the consequences of making a term mandatory, specifying a non-string Value syntax like {float}, and then allowing these stringy INSDC null terms?
INSDC are now moving to a more extensive set of missing value terms to help provide more details on the reason the value is missing https://www.insdc.org/submitting-standards/missing-value-reporting/
This is a bit off. You can't validly add these values to fields that have types like number or boolean. It's more normal to leave missing values missing or add NA, then include a separate property where explanations of the missingness can be included.
Otherwise we're creating the need for users to write tools to understand a bespoke vocabulary (under unclear governance), and thats not very good practice for a standards organisatiin.
This is the latest list of missing values is below: INSDC page was not clear, e.g. “missing:control sample” or “not applicable:control sample” AI: TWG: add the terms to the YAML AI: Chris: add details to the GenSC.org website
not collected|not provided|restricted access|missing: control sample|missing: sample group|missing: synthetic construct|missing: lab stock|missing: third party data|missing: data agreement established pre-2023|missing: endangered species|missing: human-identifiable
I have updated the gensc.org page to include the INSDC missing value terms. Note - I did not include the option "missing: data agreement established pre-2023" as I feel that is very INSDC specific as the agreement it refers to is the one agreed by the INSDC partners.
Agree Chris, that "missing: data agreement established pre-2023" is very INSDC specific.
See https://www.gensc.org/pages/standards/checklists.html#:~:text=Missing%20value%20reporting,remove%20mandatory%20terms!).
Terms to add need reviewed as some don't exactly match INSDC - https://www.insdc.org/submitting-standards/missing-value-reporting/ Need to check with @Woolly-at-EBI about the terms as it's inconsistent in insdc documentation. not applicable - control sample vs missing: control sample shown in examples below.
- missing: control sample - Information is not applicable as the sample represents a negative control sample collected in a lab.
- missing: sample group - Information is not applicable as the sample represents a group of samples that do not have a single origin. E.g. for co-assembly or transcriptome assembly.
- Not collected - information of an expected format was not given because it has not been collected
- Not provided - information of an expected format was not given, a value may be given at the later stage
- Restricted access - information exists but can not be released openly because of privacy concerns
- missing: synthetic construct - Information does not exist as the sample represents an ab-initio synthetic construct.
- missing: lab stock - Information was not collected as the sample represents a cultured cell line or model organism under long-term lab control.
- missing: third party data - Information does not exist as the metadata was not collected or reported in records predating the 2023 agreement. For use in Third Party data submissions.
- missing: endangered species - Information can not be reported as the target organism is endangered e.g. on the IUCN red-list.
- missing: human-identifiable - Information can not be reported as the metadata would make the sample human-identifiable.
Adjustments, remove capitalization & change : to something else as : is not friendly to all software.
The INSDC web page URL is the correct one. https://www.insdc.org/submitting-standards/missing-value-reporting/
These are the exact expected values, if missing values are used in INSDC where mandatory values can not be provided: not applicable not collected not provided restricted access missing: control sample missing: sample group missing: synthetic construct missing: lab stock missing: third party data missing: data agreement established pre-2023 missing: endangered species missing: human-identifiable
In ENA this is the relevant regex we are using where a mandatory field allows missing values. |(^not applicable$)|(^not collected$)|(^not provided$)|(^restricted access$)|(^missing: control sample$)|(^missing: sample group$)|(^missing: synthetic construct$)|(^missing: lab stock$)|(^missing: third party data$)|(^missing: data agreement established pre-2023$)|(^missing: endangered species$)|(^missing: human-identifiable$)
There is no capitalisation for any of these missing values. I doubt that removing a ":" will be amenable to INSDC in the short term as this got pushed out as a standard last year, so it would need to be mapped before submission to INSDC repos. ":" is not as difficult for sw to handle as things like back-slashes, but yes there may still be some problems
Yes, it is a bit inconsistent with "missing:" being prefixed sometimes. There is this sentence of guidance on the web page
When reporting a missing mandatory field, the eight granular ‘reporting level’ terms need to be preceded with the term ‘missing: ’ to declare both the absence of a true value as well as the reason.
goals:
- traceability
- consistency, or explanations for inconsistency
eg
- why is bare 'not applicable' acceptable but bare 'missing' isn't?
- why is 'missing: control sample' acceptable, when 'control sample' has the "top level" parent 'not applicable'
@pbuttigieg, @mslarae13 and I are curious about having clear GSC guidance on missing data annotations. If INSDC can't provide crystal clear guidance, we can still provide mappings to their namespace
Also including textual missing data indicators and numeric fields leads to poor computability
We should have a pattern for indicating missing data
I can report the relative abundance of the verbatim INSDC missing value indicators, but it would be harder to anticipate all of the ways submitters have modified the INSDC codes
not applicable is allowed along, no additional information
Is 'missing' allowed on its own?
From what Peter provided, would the MissingValueEnum really be
- not applicable
- not collected
- not provided
- restricted access
- missing: control sample
- missing: sample group
- missing: synthetic construct
- missing: lab stock
- missing: third party data
- ~missing: data agreement established pre-2023~
- missing: endangered species
- missing: human-identifiable
Making "missing" valid means numeric fields are not text, and that's inconsistent.
@Woolly-at-EBI to still check in & confirm.
I propose we start a new issue on how GSC standards are going to handle the reporting of MAR, MNAR, MCAR, and other forms of missing data generically.
We absolutely should not allow strings like "missing" into numeric fields.
We could map to or allow INSDC values (among others) in a broader specification on how to add explanations to empty values, without messing up data types
Here are is the rst file (I had to give it a .txt extension) and I rendered it into PDF via HTML(from the rst) so that one can see it more clearly. Had to learn the wonders of pandoc, so a useful side effect.
Reporting Missing Values.pdf missing-values.rst.txt
- these are what I want to make live on the INSDC page, as these minor changes increase the clarity, when my colleague Maira is back from her holidays. (she controls acceptance of PRs for the user documentation)
Colman(the ENA product owner) agreed that it was inconsistent with not all terms needing the "missing: " prefix, and will propose that as a change at the next INSDC meeting(May 2025!).
and to answer Montana's question, following the logic of the documentation, all the highest level terms which include "missing" ought to be accepted as a standalone term. Currently, the regex at ENA does not support this. I am now checking this with the NCBI and DDJB, if a yes, I will dink the regex's at least at ENA to accept such. If not I will push strongly for any early inclusion.
Converting to discussion until we resolve GSC methods