sourmash
sourmash copied to clipboard
sourmash tax metagenome krona and kreports are vastly different?
I am running sourmash v4.8.14 on a PacBio HiFi long read dataset. It is a test dataset with ~50% representation from two cultured taxa.
I am using this snakemake workflow (https://github.com/PacificBiosciences/pb-metagenomics-tools/tree/master/Taxonomic-Profiling-Sourmash) which essentially is running:
sourmash tax metagenome -g {input.gather} -t {input.lineages} -o {params.out_base} \
--output-dir {params.outd} --output-format krona csv_summary kreport \
--rank species 2> {log}
The resulting krona and kreports are vastly different with the krona report suggesting ~53% unclassified and the kreport only suggesting ~4% unclassified (which I tend to believe more). However, I could just not quite understand how the krona report is generated. So just looking to shed some light on why these two reports are so different.
If I put the results into krona for visualization, the two files would look like this:
kreport
krona
hi @RhettRautsaw sorry for delay - it is almost certainly due to the krona report not using abundances; see https://github.com/sourmash-bio/sourmash/issues/3577 for context. I will look into it!
cc @bluegenes
Hi @ctb. No problem! Thanks for taking the time to look into this and providing some insight.
This is being resolved in https://github.com/sourmash-bio/sourmash/pull/3711, which was a bit of a beast to make sense of. But I think I understand it all now and have both documented and tested it 🎉 !
In brief, krona format was not using abundances, while kreport was. As of v4.9.4, appropriate warnings will be printed, and --use-abund will drive consistent reporting across all formats.
You may also be interested in https://github.com/taxburst/taxburst if you're using krona, BTW :). #advertisement
sourmash v4.9.4 has been released to a pypi and a conda-forge near you!
Thanks @ctb!