sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

`sourmash tax metagenome` behaves incorrectly when there are missing lineage ranks

Open ctb opened this issue 6 months ago • 0 comments

First off, this is definitely an odd situation where undefined behavior is, well, undefined.

In brief, when assigning taxonomies using lineages with missing ranks, such as this from the podar test data set:

CP001941,439481,Archaea,Euryarchaeota,,,,Aciduliprofundum,Aciduliprofundum boonei,Aciduliprofundum boonei T469

I get tax metagenome output like this:

100.00  3603000 0       D               Archaea
100.00  3603000 0       P               Euryarchaeota
59.03   2127000 0       C               Archaeoglobi
40.97   1475999 1475999 U               unclassified
59.03   2127000 0       O               Archaeoglobales
59.03   2127000 0       F               Archaeoglobaceae
59.03   2127000 0       G               Archaeoglobus
40.97   1476000 0       G               Aciduliprofundum
59.03   2127000 2127000 S               Archaeoglobus fulgidus
40.97   1476000 1476000 S               Aciduliprofundum boonei

where the unclassified includes the species-level assignment of Aciduliprofundum.

The simplest and first fix to do might be to add in some checks to make sure that at each rank the unclassified+rank %s add up to around 100... because at the very least we should be throwing an error here! 😆

ctb avatar Dec 24 '23 16:12 ctb