sourmash
sourmash copied to clipboard
update documentation re k-mer trimming and gather vs search
a lot of our ancillary documentation and workflows (genome-grist and spacegraphcats) suggests doing k-mer trimming of metagenomes, e.g. with trim-low-abund.
but, for gather in particular, this is not particularly important. we should update documentation with a few points -
- erroneous k-mers from sequencing errors will typically not match between samples; this is a straightforward statistical argument.
- erroneous k-mers from sequencing errors will also generally be low-abundance
- Jaccard similarity and containment are both fractions that take into account total number of k-mers - including erroneous k-mers. so both similarity and containment fractions (as produced by
search
andsearch --containment
andprefetch
) will be decreased by the presence of erroneous k-mers. If this is a problem for you, you should trim. - however,
sourmash gather
uses shared k-mers, and is robust to erroneous k-mers - in particular,p_match
to genomes will be robust b/c the denominator uses a genome size, and the numerator is shared k-mers and hence will ignore errors. ref https://github.com/sourmash-bio/sourmash/issues/1289 - similarly, ANI estimates based on anchor or max_containment will generally be more robust to sequencing errors
- also, if you're worried about erroneous k-mers,
sourmash gather
with an abundance-weighted query will perform better because low-abundance k-mers will have less impact on the final coverage estimate. ref https://github.com/sourmash-bio/sourmash/issues/1818
@bluegenes this sort of fits into some of the things we've been realizing with respect to jaccard vs containment - containment is much more robust with respect to erroneous k-mers, in various simple but important ways.
ref trimming paragraph in https://github.com/sourmash-bio/sourmash/issues/1135:
Note that this could well be due to sequencing errors: if you don't do k-mer based error trimming (as above), and you have two communities that are very similar and have been deeply sequenced, this is the result I would expect to see. The reason is that erroneous k-mers will always be low abundance, while your true k-mers in a deeply sequenced metagenome will be high abundance
per conversation in https://github.com/sourmash-bio/sourmash/issues/2266, I think adapter trimming and quality filtering is also generally unnecessary. Although if QC programs like FastQC tell you you have a major problem, that might be something worth doing (or, resequencing ;).
here are some results from sourmash gather running on four different metagenomes after the report-both-weighted-and-unweighted PR was merged, https://github.com/sourmash-bio/sourmash/pull/2301:
(duplicated from https://github.com/dib-lab/genome-grist/issues/197#issuecomment-1263687946)
here are the results!
metagenome | unweighted match | weighted match |
---|---|---|
SRR606249 (podar) | 47.8% | 95.9% |
SRR1976948 (hu-s1) | 21.4% | 68.0% |
SRR12324253 (zymo mock) | 16.2% | 96.6% |
SRR5650070 (p8808mo11/iHMP) | 34.8% | 85.6% |
a few observations -
- the two mock communities, podar and zymo, are (almost) completely matched w/weighting! this is presumably because they're (a) mostly or completely built from organisms in reference databases, and (b) deeply sequence
- but boy are the mock communities not matched unweighted... this is (should be) the effect of sequencing errors
- the iHMP data set is pretty good - 85.6% matched to ref genomes!
- even the hu-s1 (~lowish diversity environmental oil well) data set is ~mostly matched
Note this is with genome-grist v0.9.0, so adapter and quality trimmed only.