sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

update documentation re k-mer trimming and gather vs search

Open ctb opened this issue 2 years ago • 2 comments

a lot of our ancillary documentation and workflows (genome-grist and spacegraphcats) suggests doing k-mer trimming of metagenomes, e.g. with trim-low-abund.

but, for gather in particular, this is not particularly important. we should update documentation with a few points -

  • erroneous k-mers from sequencing errors will typically not match between samples; this is a straightforward statistical argument.
  • erroneous k-mers from sequencing errors will also generally be low-abundance
  • Jaccard similarity and containment are both fractions that take into account total number of k-mers - including erroneous k-mers. so both similarity and containment fractions (as produced by search and search --containment and prefetch) will be decreased by the presence of erroneous k-mers. If this is a problem for you, you should trim.
  • however, sourmash gather uses shared k-mers, and is robust to erroneous k-mers - in particular, p_match to genomes will be robust b/c the denominator uses a genome size, and the numerator is shared k-mers and hence will ignore errors. ref https://github.com/sourmash-bio/sourmash/issues/1289
  • similarly, ANI estimates based on anchor or max_containment will generally be more robust to sequencing errors
  • also, if you're worried about erroneous k-mers, sourmash gather with an abundance-weighted query will perform better because low-abundance k-mers will have less impact on the final coverage estimate. ref https://github.com/sourmash-bio/sourmash/issues/1818

@bluegenes this sort of fits into some of the things we've been realizing with respect to jaccard vs containment - containment is much more robust with respect to erroneous k-mers, in various simple but important ways.

ref trimming paragraph in https://github.com/sourmash-bio/sourmash/issues/1135:

Note that this could well be due to sequencing errors: if you don't do k-mer based error trimming (as above), and you have two communities that are very similar and have been deeply sequenced, this is the result I would expect to see. The reason is that erroneous k-mers will always be low abundance, while your true k-mers in a deeply sequenced metagenome will be high abundance

ctb avatar Jul 12 '22 13:07 ctb

per conversation in https://github.com/sourmash-bio/sourmash/issues/2266, I think adapter trimming and quality filtering is also generally unnecessary. Although if QC programs like FastQC tell you you have a major problem, that might be something worth doing (or, resequencing ;).

ctb avatar Sep 08 '22 13:09 ctb

here are some results from sourmash gather running on four different metagenomes after the report-both-weighted-and-unweighted PR was merged, https://github.com/sourmash-bio/sourmash/pull/2301:

(duplicated from https://github.com/dib-lab/genome-grist/issues/197#issuecomment-1263687946)

here are the results!

metagenome unweighted match weighted match
SRR606249 (podar) 47.8% 95.9%
SRR1976948 (hu-s1) 21.4% 68.0%
SRR12324253 (zymo mock) 16.2% 96.6%
SRR5650070 (p8808mo11/iHMP) 34.8% 85.6%

a few observations -

  • the two mock communities, podar and zymo, are (almost) completely matched w/weighting! this is presumably because they're (a) mostly or completely built from organisms in reference databases, and (b) deeply sequence
  • but boy are the mock communities not matched unweighted... this is (should be) the effect of sequencing errors
  • the iHMP data set is pretty good - 85.6% matched to ref genomes!
  • even the hu-s1 (~lowish diversity environmental oil well) data set is ~mostly matched

Note this is with genome-grist v0.9.0, so adapter and quality trimmed only.

ctb avatar Sep 30 '22 15:09 ctb