vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

sintax classifier and multiple identical best hits

Open diegomic opened this issue 5 years ago • 4 comments

Dear @torognes,

Using the sintax xlassifier I noticed that the algorithm in case of multiple identical best hits only outputs the first hit irrespective of the hits after that. This may results in an wrong classification is more species have the same sequence in the reference db. Probably in these cases it would be better to report the least common ancestor of the ambigous hits. A similar issue was already reported in the issue #210 by @andzandz11. Thank you very much cheers Diego

diegomic avatar Jul 26 '18 12:07 diegomic

This is fascinating. The sintax algorithm was designed to mitigate over-classification, so I had to go back to the preprint to take a look at why this could be happening.

SINTAX algorithm For a query sequence Q and reference database R...

Turns out that the subsampling is used on each query sequence, but the reference database is not subsampled or shuffled. So sintax is unable to choose between two identical reads in the reference database.

This makes sense to me; If your database includes identical references (in the area sequenced), no tax assigner will be able to tell them apart, because they are identical!

I guess the goal would be to detect and report these multiple best hits (like with a blast output #210), or report a lower confidence for this prediction.

Colin

colinbrislawn avatar Jul 26 '18 15:07 colinbrislawn

I will consider trying to improve the sintax algorithm at a later time.

torognes avatar Aug 16 '18 08:08 torognes

Just a note that I am also seeing something that is likely due to this issue. I recently did a (rough) comparison of Illumina V4 and PacBio full length 16S using three classifiers; SINTAX gave almost equivalent results for both while dada2 and QIIME2 showed significant differences based on the length of the target, which I expected. In particular the species level assignment was very high (>60%) for the ~250nt V4 region.

cjfields avatar Aug 31 '20 20:08 cjfields

I have made several improvements to the sintax command in vsearch 2.28.1, just released. Please see issue #535 or the release notes for details.

torognes avatar Apr 26 '24 13:04 torognes