Why are there multiple taxids in the output of diamond blastp?
There are multiple taxids in my output of diamond blastp. For example, here is the blastp result of two proteins. The last column represents taxids. contig1_28 SCI95225.1 100 136 0 0 1 136 1 136 6.06e-94 277 136 137 99.3 SCI95225.1 Uncharacterised protein [uncultured Clostridium sp.] 38018;59620;2170413 contig2_12 SCI92976.1 100 34 0 0 1 34 1 34 3.93e-13 65.9 34 35 97.1SCI92976.1 Uncharacterised protein [uncultured Clostridium sp.] 38018;59620;2170413 The two proteins are aligned to SCI95225.1 in NCBI NR database. The organism in NCBI of SCI95225.1 is uncultured Clostridium sp. (https://www.ncbi.nlm.nih.gov/protein/SCI95225.1). However, the taxid of uncultured Clostridium sp. is 59620, but not 38018 or 2170413 in 38018;59620;2170413. How can I limit the number of 'taxids' in the output of diamond blastp to one? It's important for me, because I want to get the accurate taxonomic annotation of proteins predicted by prodigal.
The NR database merges identical proteins into one entry, which means there are proteins identical to SCI95225.1 for which the given taxids are correct. It's not possible to reduce sequences to one taxid without losing information in general.