diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Why are there multiple taxids in the output of diamond blastp?

Open ZongzhiWu opened this issue 3 years ago • 1 comments

There are multiple taxids in my output of diamond blastp. For example, here is the blastp result of two proteins. The last column represents taxids. contig1_28 SCI95225.1 100 136 0 0 1 136 1 136 6.06e-94 277 136 137 99.3 SCI95225.1 Uncharacterised protein [uncultured Clostridium sp.] 38018;59620;2170413 contig2_12 SCI92976.1 100 34 0 0 1 34 1 34 3.93e-13 65.9 34 35 97.1SCI92976.1 Uncharacterised protein [uncultured Clostridium sp.] 38018;59620;2170413 The two proteins are aligned to SCI95225.1 in NCBI NR database. The organism in NCBI of SCI95225.1 is uncultured Clostridium sp. (https://www.ncbi.nlm.nih.gov/protein/SCI95225.1). However, the taxid of uncultured Clostridium sp. is 59620, but not 38018 or 2170413 in 38018;59620;2170413. How can I limit the number of 'taxids' in the output of diamond blastp to one? It's important for me, because I want to get the accurate taxonomic annotation of proteins predicted by prodigal.

ZongzhiWu avatar Dec 03 '22 13:12 ZongzhiWu

The NR database merges identical proteins into one entry, which means there are proteins identical to SCI95225.1 for which the given taxids are correct. It's not possible to reduce sequences to one taxid without losing information in general.

bbuchfink avatar Dec 25 '22 10:12 bbuchfink