sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

Weirdly high unclassified proportion in defined metagenome

Open peterdoug opened this issue 8 months ago • 3 comments

Hi! I'm working with a defined metagenome with 12 species. After sequencing with nanopore, I ran sourmash gather and tax metagenome against a database of gtdbtk.

In terms of taxonomic assignments, this worked very well, both with respect to precision and accuracy: all 12 species were identified with only a handful of false positives. However, I'm confused about the percentage of "unclassified".

When I sum up the values of 'f_unique_to_query' in the raw gather output, I only get 0.08. Also, when examining the krona output from tax, 92% of the metagenome is unclassified, so these numbers agree. However, when I examine the human output from tax metagenome, it says only 6.6% is unclassified. How is this value calculated and why is it different?

In general, 6.6% unclassified makes a lot more sense for a defined metagenome where all the member genomes are in my database. I also ran sourmash on my metagneomic assembly: 128/134 contigs were classified, and all were classified as one of the 12 member species in my defined metagenome. So it doesn't seem like there is a large percentage of some "other" organism in this metagenome.

Any ideas what I might be doing wrong here? Thanks!

peterdoug avatar Mar 11 '25 16:03 peterdoug

more in a bit, but: what is the sum of f_unique_weighted? And what does sourmash gather report as the % weighted assigned?

In brief: sequencing errors cause a lot of novel k-mers. If you weight them by multiplicity (or filter out low-abundance k-mers with sourmash sig filter) you will be able to ignore the low abundance ones, and that might (should) improve your output.

ctb avatar Mar 11 '25 16:03 ctb

You also might have to do a sketch with -p abund, e.g. sourmash sketch dna -k 31,abund or some such.

ctb avatar Mar 11 '25 16:03 ctb

Ah, the sum of f_unique_weighted is 0.93. Is assume this is the column that the tax metagenome -F human uses? To test, I copied the values of f_unique_weighted to f_unique_to_query and ran tax metagenome again. The new csv_summary and krona output files have 7% unassigned and reflect the human format abundance output. (They also more accurately reflect the actual microbial community abundances:) Why isn't f_unique_weighted used to generate metagenomic abundances? And is it a bad idea to do so? FYI, my query was sketched with -p abund. Thanks!

peterdoug avatar Mar 12 '25 09:03 peterdoug

This took a while to track down and resolve fully; see https://github.com/sourmash-bio/sourmash/pull/3711 for too many details!

Once that PR is merged and we release sourmash v4.9.4, you can use --use-abund to get consistent reporting with -F human (which uses abundances) and -F krona (which does not, by default).

Again, our apologies for making this so difficult and inconsistent!

You may also be interested in https://github.com/taxburst/taxburst if you're using krona, BTW :). #advertisement

ctb avatar Jul 31 '25 14:07 ctb

Great, thank you for clarifying this!

peterdoug avatar Aug 06 '25 06:08 peterdoug

sourmash v4.9.4 has been released to a pypi and a conda-forge near you!

ctb avatar Aug 09 '25 15:08 ctb