charcoal icon indicating copy to clipboard operation
charcoal copied to clipboard

`gather_at_rank` handling ties in taxonomic assignment

Open taylorreiter opened this issue 4 years ago • 0 comments

#171 updated charcoal to sourmash>=4.1.0, including switching from sourmash search to sourmash prefetch. The taxonomy output for one contig in test file LoombaR_2017__SID1050_bax__bin.11.fa.gz changed. As recorded in that issue:

jq . < tests/test-data/loomba/LoombaR_2017__SID1050_bax__bin.11.fa.gz.contigs-tax.json > out.old
jq . < tests/test-data/loomba/LoombaR_2017__SID1050_bax__bin.11.fa.gz.contigs-tax.json > out.new

diff out.old out.new
2629c2629
<             "f__Acutalibacteraceae"
---
>             "f__Oscillospiraceae"
2633c2633
<             "g__Anaeromassilibacillus"
---
>             "g__Flavonifractor"

@ctb surmised:

This is likely because gather doesn't report ties, per dib-lab/sourmash#1366 and dib-lab/sourmash#278. It is slightly surprising in this case that the tie here is above the family level (!!) but these things happen.

It's probably a good idea for gather_at_rank to detect and handle/report such ties, and probably pull the taxonomic assignment up to the level above the tie.

@bluegenes

taylorreiter avatar May 21 '21 14:05 taylorreiter