sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

use 'detection' terminology for fraction-of-genome-kmers-found

Open ctb opened this issue 1 year ago • 1 comments

at STAMPS 2022, I described the fraction of genome k-mers found (p_match in gather text output, f_match_query in the prefetch and gather CSV output) as the genomic "extent". I learned that a different/additional term was "detection".

I found this reference which seems to use the term: https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-019-0690-0

and am searching for more.

As part of documentation revamp, e.g. https://github.com/sourmash-bio/sourmash/issues/1289, could switch to using 'detection', which is growing on me as a term...

ctb avatar Aug 02 '22 12:08 ctb

per mike lee,

I first came across it from anvio when that first came out, though it’s not described in the initial paper. The only place I think it’s documented is actually a page I put together for the anvio site 5 years ago covering a few of its terms. For detection, it’s here written in the context of contigs with an example visualization of what you already understand: https://merenlab.org/2017/05/08/anvio-views/#detection

But the generalized definition would just be something like: The proportion of a given reference sequence that is covered/matched/identified at least 1X.

Or more straightforward: The proportion of a given reference sequence that is detected.

The first time [Meren] used it for a genome-level purpose, and defined it, was in this paper: https://peerj.com/articles/4320/

ctb avatar Aug 26 '22 13:08 ctb