Inconsistent numbers of hits across split databases
Dear Diamond Developers,
I have a been using diamond blastp to search a database with itself (i.e. to build a similarity matrix from the pairwise hits). For technical reasons, I would like to perform this calculation in parts, which requires me to split the database into three sub-databases, perform the searches, and combine the results. When I compare the result to doing everything in one step, I see significant differences between the split and full calculations (~20% in terms of numbers of hits). Could you suggest parameters that would minimize this difference? Here are the particulars of my calculation:
- Inputs are AA sequences, usually of length 20-30 e.g.:
seq1 DSVNNIPSGTAVLGAGTASKLT seq2 DRVSQSIYSNGDAVNIGNDMR seq3 NSASQSVYSSGVVGSGGYQKVT seq4 TSINNIRSNEREATGNFGNEKLT seq5 YSGSPEHISRALRALFGNVLH
- I am interested in getting all hits with sequence identity >= 80% and coverage >= 90%
- The number of inputs ranges from 150,000 to 100,000,000
- My current command looks like
diamond blastp -q query_fasta -d target_seqdb -o result_file --evalue 100.0 --id 80 --query-cover 90 --sensitive --outfmt 6 qseqid sseqid
Grateful for your feedback Daron
This is probably caused by the ranking heuristic, you can try using --no-ranking.
Thanks for your lightning fast response! I quickly checked the effect of --no-ranking but there was no effect on the difference (whole vs split). If you have any other ideas, I will definitely try them all.
Database size also affects the evalue, so try to fix the size using --dbsize.
Hmm... --dbsize 10000 also had no effect. I am uploading my test case you have time to try. I usually do a 30/70% split when I run it in parts as illustrated in the attached.
all.fa.zip
I will play more with the dbsize parameter tomorrow in case I picked a bad value
Just an additional clue: I could recapitulate the all vs all result using the program last if I cranked up the parameter m, which represents the " maximum initial matches per query position", to 10000. I don't know if there is a similar parameter in diamond...