mixs icon indicating copy to clipboard operation
mixs copied to clipboard

Add term for denoising method

Open LynnDelgat opened this issue 1 year ago • 6 comments

New term details

Term name - Denoising approach
Structured comment name - denoising_appr
Definition - Tool and parameters used to denoise sequence reads.
Expected value - algorithm name, version and relevant parameters
Value syntax - {algorithm name and version};{parameter name 1:parameter value 1, parameter name 2:parameter value 2,...}
Example - UNOISE3;alpha:2
Preferred unit - NA
Extension(s) - MimarksS, MimarksC
Relationship to other MIXS terms - NA

Additional context Denoising is a widely used method in (meta)barcoding and is an essential step in the bioinformatics processing pipeline. A specific term to document this step is currently missing. More information can also be found in this GitHub issue: https://github.com/gbif/doc-publishing-dna-derived-data/issues/147

LynnDelgat avatar Jan 05 '24 15:01 LynnDelgat

Alternatively, "otu_class_appr" could be made more inclusive (see also https://github.com/GenomicsStandardsConsortium/mixs/issues/603), to not only clustering methods, but also denoising methods (e.g. "Tool and parameteres used when clustering and/or denoising reads."). In that case, it would be useful to have a term where the broad category of the method (clustering vs. denoising) can be indicated with a controlled vocabulary so that users can easily distinguish whether a clustering or denoising method was used. (A disadvantage however to using "otu_class_appr" also for denoising methods is that the name would be quite misleading since it contains otu.)

LynnDelgat avatar Jul 02 '24 10:07 LynnDelgat

As I think about data re-use, I'm with @LynnDelgat on this one. We should create a new term that captures both methods of grouping reads, without the misleading 'otu' prefix.

The point of this information is to understand how the reads were grouped so that a representative sequence could be analyzed. No doubt there will be new methods in the future, and I think the best way forward is a term that acts as a coarse filter (the controlled vocab mentioned above) and then additional terms that capture the nuanced variation of the methods.

sformel-usgs avatar Jul 09 '24 13:07 sformel-usgs

@LynnDelgat can you please give some more examples of valid and invalid values for this proposed denoising_appr term?

turbomam avatar Jul 09 '24 14:07 turbomam

I am concerned about the ... part of the parameter name & value pattern, and the fact that there are no constraints on the 'algorithm name and version' or the 'parameter name'.

Do we aspire that different submitters using this new term will populate it in the same way, to enable meaningful searches and groupings?

turbomam avatar Jul 09 '24 14:07 turbomam

@turbomam To enable meaningful searches and groupings, it would probably be easier to split each component into separate fields, but I suggested this term in analogy with other existing terms, which all seem to group software, version and parameters in one field. From a personal viewpoint, this field (or otu_class_appr if we decide to make that one more inclusive) is meant to document provenance/methodology as will always be difficult to make meaningful searches on it since people could write the algorithm name or parameter names however they like (in absence of controlled vocabularies for them). For a field to filter on, one describing the broad category of the method with a controlled vocabulary, would suffice for our intended use. However, other people might need more detailed searches of course. I'll try to be more clear so that a constraining pattern could be added if needed (though I am not sure if I am the best person to determine this): Revised value syntax - {algorithm name};{versionnumber};({parameter name 1:parameter value 1,parameter name 2:parameter value 2,...,parameter name n:parameter value n}|{"default parameters"}) Examples:

  • UNOISE;3;alpha:2
  • UNOISE;3;alpha:2,minsize:8
  • UNOISE;3;default parameters

So the proposed pattern would be something like: Any character any number of times (min.1) ";" any character any number of times (min.1) ";" and then "default parameters" or any number of repetitions (min. 1) of: any character any number of times (min.1) ":" any character any number of times (min.1) separated by "," between repetitions. But I don't know if that's not too restrictive, because if a data provider is only willing to/ can only provide the algorithm name, we would still like to be able to capture that. So probably these should also be allowed:

  • UNOISE;3 (unknown parameters)
  • UNOISE (unknown parameters and unknown version number)

LynnDelgat avatar Jul 10 '24 09:07 LynnDelgat

Thanks @LynnDelgat for the additional valid examples. I agree that there is a strong precedent for pseudo-specifications, like you provided, in MIxS, and I appreciate your effort at consistency. And I don't think you are accountable for solving these problems. Ultimately, I'm responding to this proposal for other technical implementers to review.

Having said that, I see this kind of specification to be one of MIxS' greatest weaknesses. These terms are not at all machine actionable, and in my experience they aren't very useful for human review either. I can give some examples from the INSDC Biosample records if you want.

So, if you're interested in discussing this more, my next questions would be

  • Can you describe the end-user's use case for this term or other terms of similar style? Do you ever retrieve biosamples or their related sequence files and and make some decisions based on the contents of terms like this? Or are terms like this really write-only, populated out of obligation on the submitter's part, but not really intended for end users to act upon?
  • If they are write-only, what would you think of flagging terms like this as being "not intended for searching, filtering or parsing" within the MIxS model?
  • One development philosophy we're pursuing here, based on the NMDC schema and other LinkML schemas, is that invalid examples are as important as valid examples. Is there anything that you would consider an invalid value for this new term? As a trivial example, if the submitter is providing a E- or p-value cutoff, are 0.001 and 1e-2 both acceptable?

turbomam avatar Jul 10 '24 14:07 turbomam