diamond
diamond copied to clipboard
Problem with --max-target-seqs/-k
Hello,
I believe there is a problem with the --max-target-seqs/-k option. If I understand how it works, this option should allow us to find the "best-hit" since the results are sorted before displaying the best (https://github.com/bbuchfink/diamond/issues/232). According to my tests, the option seems to display the first result instead, without doing the sorting. Here is an example where I have a read in query and two proteins in the database.
@M06545:94:000000000-K8835:1:2113:7581:17578 2:N:0:1 AGAAAAGAAAGGCAAGTAAAGAGTATGGTTTATATAATCAATGTAAGAAACTAAATGATGATGAATTATTTCGCTTACTTGATGATCACAATTCCTTGAAAAGGATTTCATCTGCCAGAGTAT + CCCCCFGGGFGGGGGGGGGGGGGGGGGGEGGGGGGFGGGFEGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFAEFGGGGGGGGGGGGGGGDGEFEGGGGGGG,CECFFDFGG
K12-WT_00231 Protein YibA MSNTYQKRKASKEYGLYNKCKKLNDDELFRLLDDRNSLKRISSARVLQLRGGQDAVRLAI EFCTDKNYIRRDIGAFILGQI
K12-WT_00113 Protein YibA MSNTYQKRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARVLQLRGGQDAVRLAI EFCSDKNYIRRDIGAFILGQIKICKKCEDNVFNILNNMALNDKSACVRATAIESTAQRCK KNPIYSPKIVEQSQITAFDKSTNVRRATAFAISVINDKATIPLLINLLKDPNGDVRNWAA FAININKYDNSDIRDCFVEMLQDKNEEVRIEAIIGLSYRKDKRVLSVLCDELKKNTVYDD IIEAAGELGDKTLLPVLDTMLYKFDDNEIITSAIDKLKRS
Here is the result of Diamond, without the -k 1 option:
K12-WT_00231 Protein YibA Length=81
Score = 77.8 bits (190), Expect = 1.82e-22 Identities = 38/40 (95%), Positives = 39/40 (97%), Gaps = 0/40 (0%) Frame = -3
Query 121 KRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARV 2 KRKASKEYGLYN+CKKLNDDELFRLLDD NSLKRISSARV Sbjct 7 KRKASKEYGLYNKCKKLNDDELFRLLDDRNSLKRISSARV 46
K12-WT_00113 Protein YibA Length=280
Score = 82.4 bits (202), Expect = 1.91e-22 Identities = 40/40 (100%), Positives = 40/40 (100%), Gaps = 0/40 (0%) Frame = -3
Query 121 KRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARV 2 KRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARV Sbjct 7 KRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARV 46
Normally, with -k 1, the only hit displayed should be with the K12-WT_00113 sequence, since its score is higher than with the K12-WT_00231 sequence. However, with the -k 1 option, Diamond only shows me the result with the K12-WT_00231 sequence.
Here is the full command:
diamond blastx --query CONCAT.fastq.gz --db database.dmnd -f 0 -k 1 -e 1e-10 -c 1 -p 2 --out test.diamond
Thank you !
Antony
Hits are not sorted by score but by evalue, as you can see the first hit has a lower evalue despite its lower score. That happens sometimes since the target length is also taken into account for the e-value computation.
Thanks, I understand. Is it possible to change the way diamond sorts hits? This behavior induces several false positives in my pipeline.
If you use the --top
option hits are sorted by score instead.
If I am not wrong, the "--top" option takes a percentage. Is it possible with this option to have only the best hit?
You can use --top 0
, although if there are several best hits with equal score it will report all of them.
With -k 1
is there any other sorting after e-values. For example, if the two top hits have a equal e-values is there anything to break the tie?
Thanks!
With
-k 1
is there any other sorting after e-values. For example, if the two top hits have a equal e-values is there anything to break the tie?
It is sorted by bit score then and if equal, by the order of the targets in the database.