vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

Mismatches in taxonomic ranks with Sintax

Open ashleyp1 opened this issue 1 year ago • 6 comments

I encountered some confusing results while testing sintax on my data. I'm running v 2.28.1 on near full length 16S amplicons against a custom database. For some of my samples (mostly ones without high confidence values) I get mixed taxonomies that seem to jump around, like below.

0faf4970-8f6a-4a6c-9d55-26f7c80d50fc d:Bacteria(1.00),p:Firmicutes(1.00),c:Bacilli(1.00),o:Bacillales(0.83),g:Exiguobacterium(0.48),s:Exiguobacterium_acetylicum(0.24)
5f9d0909-fe7d-409d-9da8-26c2749bb0cc d:Bacteria(1.00),p:Firmicutes(1.00),c:Bacilli(1.00),o:Bacillales(1.00),g:Exiguobacterium(1.00),s:Exiguobacterium_acetylicum(0.74)
37270a98-6e0c-4130-ae8d-8c47399abcdd d:Bacteria(1.00),p:Firmicutes(1.00),c:Bacilli(1.00),o:Bacillales(0.99),f:Listeriaceae(0.60),g:Listeria(0.60),s:Exiguobacterium_acetylicum(0.25)
0aa23c22-ff54-4b20-8663-ef25a6338227 d:Bacteria(1.00),p:Proteobacteria(0.59),c:Gammaproteobacteria(0.58),o:Enterobacterales(0.57),f:Enterobacteriaceae(0.52),g:Exiguobacterium(0.36),s:Salmonella_enterica(0.29)

The first two show the lineage that I would expect for Exiguobacterium, but how did it go from Listeria to Exiguo and Exiguo to Salmonella on the next two?

I thought it was an error in my database at first, but I checked and confirmed that the lineages are all correct and formatted properly. At this point, I assume this is most likely a fault in my understanding of how sintax works and I know that the bootstrap values for those two are low enough I probably won't use them, but I'd still like to understand how this is happening.

Thanks!

ashleyp1 avatar Sep 13 '24 21:09 ashleyp1

Hi, thank you for reporting this issue!

This does not look right.

Although taxonomic ranks with low-confidence, e.g. with values below 0.8, should not be trusted, the classifications should not jump between different clades in the tree as you go down to the species level.

I'll look deeper into the issue as soon as possible.

Could you please send me the exact command you ran?

Would it be possible to send me (a subset of) the queries and the database used? Or is it confidential?

torognes avatar Sep 16 '24 10:09 torognes

Here is the command I used. I sent you an invite to a dropbox folder with my database and the sample I first found the issue in. Thanks for looking into this!

vsearch --sintax \
    1-filt-trimmed-HL068_FW.fastq.gz \
    --db sintax_db.fasta \
    --tabbedout 1-68_sintax.tsv \
    --sintax_cutoff 0.7 --strand both -notrunclabels

ashleyp1 avatar Sep 16 '24 18:09 ashleyp1

Thank you, I'll look into it. Got the data.

torognes avatar Sep 17 '24 07:09 torognes

There was a logical bug in the selection of the best lineages. It should be fixed now in commit aa94d1c. I think it should only appear when the confidence is below 0.5, so it shouldn't matter much in most cases, although it was confusing.

I will make a new release soon with this fix.

Sorry for the bug and thank you very much for reporting this issue!

torognes avatar Sep 19 '24 16:09 torognes

BTW, I'll recommend using the --sintax_random option to avoid length bias in the taxonomic classification.

torognes avatar Sep 19 '24 16:09 torognes

The fixes are available now in release 2.29.0:

https://github.com/torognes/vsearch/releases/tag/v2.29.0

torognes avatar Sep 26 '24 11:09 torognes