diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Problem with --max-target-seqs/-k

Open ATVincent opened this issue 1 year ago • 7 comments

Hello,

I believe there is a problem with the --max-target-seqs/-k option. If I understand how it works, this option should allow us to find the "best-hit" since the results are sorted before displaying the best (https://github.com/bbuchfink/diamond/issues/232). According to my tests, the option seems to display the first result instead, without doing the sorting. Here is an example where I have a read in query and two proteins in the database.

@M06545:94:000000000-K8835:1:2113:7581:17578 2:N:0:1 AGAAAAGAAAGGCAAGTAAAGAGTATGGTTTATATAATCAATGTAAGAAACTAAATGATGATGAATTATTTCGCTTACTTGATGATCACAATTCCTTGAAAAGGATTTCATCTGCCAGAGTAT + CCCCCFGGGFGGGGGGGGGGGGGGGGGGEGGGGGGFGGGFEGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFAEFGGGGGGGGGGGGGGGDGEFEGGGGGGG,CECFFDFGG

K12-WT_00231 Protein YibA MSNTYQKRKASKEYGLYNKCKKLNDDELFRLLDDRNSLKRISSARVLQLRGGQDAVRLAI EFCTDKNYIRRDIGAFILGQI

K12-WT_00113 Protein YibA MSNTYQKRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARVLQLRGGQDAVRLAI EFCSDKNYIRRDIGAFILGQIKICKKCEDNVFNILNNMALNDKSACVRATAIESTAQRCK KNPIYSPKIVEQSQITAFDKSTNVRRATAFAISVINDKATIPLLINLLKDPNGDVRNWAA FAININKYDNSDIRDCFVEMLQDKNEEVRIEAIIGLSYRKDKRVLSVLCDELKKNTVYDD IIEAAGELGDKTLLPVLDTMLYKFDDNEIITSAIDKLKRS

Here is the result of Diamond, without the -k 1 option:

K12-WT_00231 Protein YibA Length=81

Score = 77.8 bits (190), Expect = 1.82e-22 Identities = 38/40 (95%), Positives = 39/40 (97%), Gaps = 0/40 (0%) Frame = -3

Query 121 KRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARV 2 KRKASKEYGLYN+CKKLNDDELFRLLDD NSLKRISSARV Sbjct 7 KRKASKEYGLYNKCKKLNDDELFRLLDDRNSLKRISSARV 46

K12-WT_00113 Protein YibA Length=280

Score = 82.4 bits (202), Expect = 1.91e-22 Identities = 40/40 (100%), Positives = 40/40 (100%), Gaps = 0/40 (0%) Frame = -3

Query 121 KRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARV 2 KRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARV Sbjct 7 KRKASKEYGLYNQCKKLNDDELFRLLDDHNSLKRISSARV 46

Normally, with -k 1, the only hit displayed should be with the K12-WT_00113 sequence, since its score is higher than with the K12-WT_00231 sequence. However, with the -k 1 option, Diamond only shows me the result with the K12-WT_00231 sequence.

Here is the full command:

diamond blastx --query CONCAT.fastq.gz --db database.dmnd -f 0 -k 1 -e 1e-10 -c 1 -p 2 --out test.diamond

Thank you !

Antony

ATVincent avatar Mar 15 '23 18:03 ATVincent

Hits are not sorted by score but by evalue, as you can see the first hit has a lower evalue despite its lower score. That happens sometimes since the target length is also taken into account for the e-value computation.

bbuchfink avatar Mar 16 '23 13:03 bbuchfink

Thanks, I understand. Is it possible to change the way diamond sorts hits? This behavior induces several false positives in my pipeline.

ATVincent avatar Mar 16 '23 14:03 ATVincent

If you use the --top option hits are sorted by score instead.

bbuchfink avatar Mar 16 '23 17:03 bbuchfink

If I am not wrong, the "--top" option takes a percentage. Is it possible with this option to have only the best hit?

ATVincent avatar Mar 16 '23 22:03 ATVincent

You can use --top 0, although if there are several best hits with equal score it will report all of them.

bbuchfink avatar Mar 17 '23 08:03 bbuchfink

With -k 1 is there any other sorting after e-values. For example, if the two top hits have a equal e-values is there anything to break the tie?

Thanks!

rohansachdeva avatar Jun 15 '23 22:06 rohansachdeva

With -k 1 is there any other sorting after e-values. For example, if the two top hits have a equal e-values is there anything to break the tie?

It is sorted by bit score then and if equal, by the order of the targets in the database.

bbuchfink avatar Jun 17 '23 13:06 bbuchfink