sourmash update documentation re k-mer trimming and gather vs search

update documentation re k-mer trimming and gather vs search

Open ctb opened this issue 2 years ago • 2 comments

a lot of our ancillary documentation and workflows (genome-grist and spacegraphcats) suggests doing k-mer trimming of metagenomes, e.g. with trim-low-abund.

but, for gather in particular, this is not particularly important. we should update documentation with a few points -

erroneous k-mers from sequencing errors will typically not match between samples; this is a straightforward statistical argument.
erroneous k-mers from sequencing errors will also generally be low-abundance
Jaccard similarity and containment are both fractions that take into account total number of k-mers - including erroneous k-mers. so both similarity and containment fractions (as produced by search and search --containment and prefetch) will be decreased by the presence of erroneous k-mers. If this is a problem for you, you should trim.
however, sourmash gather uses shared k-mers, and is robust to erroneous k-mers - in particular, p_match to genomes will be robust b/c the denominator uses a genome size, and the numerator is shared k-mers and hence will ignore errors. ref https://github.com/sourmash-bio/sourmash/issues/1289
similarly, ANI estimates based on anchor or max_containment will generally be more robust to sequencing errors
also, if you're worried about erroneous k-mers, sourmash gather with an abundance-weighted query will perform better because low-abundance k-mers will have less impact on the final coverage estimate. ref https://github.com/sourmash-bio/sourmash/issues/1818

@bluegenes this sort of fits into some of the things we've been realizing with respect to jaccard vs containment - containment is much more robust with respect to erroneous k-mers, in various simple but important ways.

ref trimming paragraph in https://github.com/sourmash-bio/sourmash/issues/1135:

Note that this could well be due to sequencing errors: if you don't do k-mer based error trimming (as above), and you have two communities that are very similar and have been deeply sequenced, this is the result I would expect to see. The reason is that erroneous k-mers will always be low abundance, while your true k-mers in a deeply sequenced metagenome will be high abundance

Jul 12 '22 13:07 ctb

per conversation in https://github.com/sourmash-bio/sourmash/issues/2266, I think adapter trimming and quality filtering is also generally unnecessary. Although if QC programs like FastQC tell you you have a major problem, that might be something worth doing (or, resequencing ;).

Sep 08 '22 13:09 ctb

here are some results from sourmash gather running on four different metagenomes after the report-both-weighted-and-unweighted PR was merged, https://github.com/sourmash-bio/sourmash/pull/2301:

(duplicated from https://github.com/dib-lab/genome-grist/issues/197#issuecomment-1263687946)

here are the results!

metagenome	unweighted match	weighted match
SRR606249 (podar)	47.8%	95.9%
SRR1976948 (hu-s1)	21.4%	68.0%
SRR12324253 (zymo mock)	16.2%	96.6%
SRR5650070 (p8808mo11/iHMP)	34.8%	85.6%

a few observations -

the two mock communities, podar and zymo, are (almost) completely matched w/weighting! this is presumably because they're (a) mostly or completely built from organisms in reference databases, and (b) deeply sequence
but boy are the mock communities not matched unweighted... this is (should be) the effect of sequencing errors
the iHMP data set is pretty good - 85.6% matched to ref genomes!
even the hu-s1 (~lowish diversity environmental oil well) data set is ~mostly matched

Note this is with genome-grist v0.9.0, so adapter and quality trimmed only.

Sep 30 '22 15:09 ctb

sourmash sourmash copied to clipboard

update documentation re k-mer trimming and gather vs search

sourmash
sourmash copied to clipboard