diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Inconsistent numbers of hits across split databases

Open ghost opened this issue 2 years ago • 5 comments

Dear Diamond Developers,

I have a been using diamond blastp to search a database with itself (i.e. to build a similarity matrix from the pairwise hits). For technical reasons, I would like to perform this calculation in parts, which requires me to split the database into three sub-databases, perform the searches, and combine the results. When I compare the result to doing everything in one step, I see significant differences between the split and full calculations (~20% in terms of numbers of hits). Could you suggest parameters that would minimize this difference? Here are the particulars of my calculation:

  1. Inputs are AA sequences, usually of length 20-30 e.g.:

seq1 DSVNNIPSGTAVLGAGTASKLT seq2 DRVSQSIYSNGDAVNIGNDMR seq3 NSASQSVYSSGVVGSGGYQKVT seq4 TSINNIRSNEREATGNFGNEKLT seq5 YSGSPEHISRALRALFGNVLH

  1. I am interested in getting all hits with sequence identity >= 80% and coverage >= 90%
  2. The number of inputs ranges from 150,000 to 100,000,000
  3. My current command looks like

diamond blastp -q query_fasta -d target_seqdb -o result_file --evalue 100.0 --id 80 --query-cover 90 --sensitive --outfmt 6 qseqid sseqid

Grateful for your feedback Daron

ghost avatar Oct 27 '23 07:10 ghost

This is probably caused by the ranking heuristic, you can try using --no-ranking.

bbuchfink avatar Oct 27 '23 08:10 bbuchfink

Thanks for your lightning fast response! I quickly checked the effect of --no-ranking but there was no effect on the difference (whole vs split). If you have any other ideas, I will definitely try them all.

daron-m-standley avatar Oct 27 '23 09:10 daron-m-standley

Database size also affects the evalue, so try to fix the size using --dbsize.

bbuchfink avatar Oct 27 '23 09:10 bbuchfink

Hmm... --dbsize 10000 also had no effect. I am uploading my test case you have time to try. I usually do a 30/70% split when I run it in parts as illustrated in the attached. all.fa.zip Screenshot 2023-10-27 at 18 49 33 I will play more with the dbsize parameter tomorrow in case I picked a bad value

daron-m-standley avatar Oct 27 '23 09:10 daron-m-standley

Just an additional clue: I could recapitulate the all vs all result using the program last if I cranked up the parameter m, which represents the " maximum initial matches per query position", to 10000. I don't know if there is a similar parameter in diamond...

daron-m-standley avatar Oct 28 '23 01:10 daron-m-standley