diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Big difference for different versions

Open 473021677 opened this issue 3 years ago • 10 comments

Hi: I am using diamond to conduct the all-vs-all blastp analysis for the protein file with 1G file size. But I encountered one problem. When I using diamond v0.9.8.109, the file size of the resulting daa file and m8 file is 97.85GB and 37.82GB, respectively. However, when I using diamond v2.0.6, the file size of the resulting daa file and m8 file is 97.85GB and 37.82GB, respectively. There's a big difference between using v0.9.8.109 and using diamond v2.0.6. I don't know what's wrong with it. Could you help me? I really appreciate if you could help. Thanks very much.

Best regards

473021677 avatar Apr 19 '21 00:04 473021677

You listed the same file sizes for both runs, I assume that is an error? You can try to reproduce this problem on a smaller sequence set so I can take a look (please also include command lines). Also, much has changed about the algorithm between these 2 versions, so I would not expect them to produce identical results.

bbuchfink avatar Apr 19 '21 07:04 bbuchfink

Sorry, When I using diamond v0.9.8.109, the file size of the resulting daa file and m8 file is 97.85GB and 37.82GB, respectively. However, when I using diamond v, the file size of the resulting daa file and m8 file is 143.63GB and 5.40GB, respectively. There's a big difference between using v0.9.8.109 and using diamond v2.0.6. The command for creating database is "diamond makedb --in Combined_archaea.fasta -d Combined_archaea" and the commond for blastp is "diamond blastp -d Combined_archaea -q Combined_archaea.fasta -a Combined_archaea_diamond -p 20 -e 1e-10 --id 25 -k 250". The view diamond is "diamond view -a Combined_archaea -o Combined_archaea.m8". And when I use a smaller sequence dataset (12.78M), this problem don't appear and the resuting file sizes are similar. Thanks.

Best regards,      ------------------ Original ------------------ From: "Benjamin Buchfink"; Date: 2021年4月19日(星期一) 下午3:43 To: "bbuchfink/diamond"; Cc: "473021677"; "Author"; Subject: Re: [bbuchfink/diamond] Big difference for different versions (#457)

 

You listed the same file sizes for both runs, I assume that is an error? You can try to reproduce this problem on a smaller sequence set so I can take a look (please also include command lines). Also, much has changed about the algorithm between these 2 versions, so I would not expect them to produce identical results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

473021677 avatar Apr 19 '21 08:04 473021677

Sorry, When I using diamond v0.9.8.109, the file size of the resulting daa file and m8 file is 97.85GB and 37.82GB, respectively. However, when I using diamond v2.0.6, the file size of the resulting daa file and m8 file is 143.63GB and 5.40GB, respectively. There's a big difference between using v0.9.8.109 and using diamond v2.0.6. The command for creating database is "diamond makedb --in Combined_archaea.fasta -d Combined_archaea" and the commond for blastp is "diamond blastp -d Combined_archaea -q Combined_archaea.fasta -a Combined_archaea_diamond -p 20 -e 1e-10 --id 25 -k 250". The view diamond is "diamond view -a Combined_archaea -o Combined_archaea.m8". And when I use a smaller sequence dataset (12.78M), this problem don't appear and the resuting file sizes are similar. Thanks.

Best regards,      ------------------ Original ------------------ From: "Benjamin Buchfink"; Date: 2021年4月19日(星期一) 下午3:43 To: "bbuchfink/diamond"; Cc: "473021677"; "Author"; Subject: Re: [bbuchfink/diamond] Big difference for different versions (#457)

 

You listed the same file sizes for both runs, I assume that is an error? You can try to reproduce this problem on a smaller sequence set so I can take a look (please also include command lines). Also, much has changed about the algorithm between these 2 versions, so I would not expect them to produce identical results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

473021677 avatar Apr 19 '21 08:04 473021677

Try to run diamond view also with -k 250 when using v2.0.6, that may explain the difference in m8 file size.

bbuchfink avatar Apr 19 '21 08:04 bbuchfink

I have tried to run diamond v2.0.6 view with -k 250 and the resulting file sizes are similar to that of diamond v0.9.8.109. I will always run diamond v2.0.6 with -k 250. And I want to know why the resulting file sizes were simalar for the smaller dataset(12.78M). Thanks very much.

Best regards     ------------------ Original ------------------ From: "Benjamin Buchfink"; Date: 2021年4月19日(星期一) 下午4:07 To: "bbuchfink/diamond"; Cc: "473021677"; "Author"; Subject: Re: [bbuchfink/diamond] Big difference for different versions (#457)

 

Try to run diamond view also with -k 250 when using v2.0.6, that may explain the difference in m8 file size.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

473021677 avatar Apr 19 '21 08:04 473021677

I also observe significant speed loss between 0.9 and 2.0: The example run of diamond 0.9.14 and 2.0.9. Query: 518 AA sequences DB: ncbi NR (.dmnd files of 128G, generated by the corresponding version)

Running on 20-core HT server with 128G ram

0.9.14: diamond-0.9.14 blastp --db NR-0.9.14 -q ./query.fasta --out ./report-0.9.txt -p 40 1>./log.0.9.txt 2>&1 Time 12m, 12617 HSPS, 513 query sequences aligned

2.0.9: diamond-2.0.9 blastp --db NR-2.0.9 -q ./query.fasta --out ./report-2.0.txt -p 40 -b5 -c1 1>log.2.0.txt 2>&1 Time 25m, 12711 HSPS, 516 query sequences aligned

starling13 avatar May 06 '21 12:05 starling13

I would guess this is due to the runtime repeat masking, so try running with --masking 0. Diamond is not very efficient for such small query files, but improvements in this regard are upcoming.

bbuchfink avatar May 06 '21 13:05 bbuchfink

I would guess this is due to the runtime repeat masking, so try running with --masking 0. Diamond is not very efficient for such small query files, but improvements in this regard are upcoming.

Thank you for reply. With --masking 0 time decreases from 25 to 20 minutes for version 2.0.9 and stay unchanged (about 10-12m) for 0.9.14

starling13 avatar May 11 '21 11:05 starling13

I'm not sure what else could be causing this difference. Optimizations for small query files are available but still in beta stage, as described here: https://github.com/bbuchfink/diamond/issues/419#issuecomment-831154792 It will probably be a couple of weeks until I release this officially.

bbuchfink avatar May 13 '21 13:05 bbuchfink

v2.0.11 now contains some optimizations for small query files. You can also get the old behaviour back using the option --algo ctg, which may or may not improve performance depending on the file size. Note that this option should only be used for small query files.

bbuchfink avatar Jul 05 '21 09:07 bbuchfink