diamond Execution time increases as CPU threads are increased

Hello Team,

I'm using diamond v2.1.11.165 for querying a small input query of size 7500 sequences against the full uniprot with 250M sequences indexed with blask makeblastdb first and then with diamond prepdb option.

I tried to run the diamond blastp same command(shown below) using different number of CPU/threads. I was expecting that the increase in number of CPU/threads would finish my execution faster but it's other way around for me.

Option1 - 8CPU (Max: 44Threads): Time taken 2Hrs diamond blastp --query query.faa --db uniprot.fasta --out diamond_out_vs_8cpu_7Kseqs.tsv --log --very-sensitive

Option2 - 10CPU (Max: 120Threads): Time taken 2Hrs 20Mins diamond blastp --query query.faa --db uniprot.fasta --out diamond_out_vs_10cpu_7Kseqs.tsv --log --very-sensitive --threads 120

Option2 - 12CPU (Max: 120Threads): Time taken 2Hrs 45Mins diamond blastp --query query.faa --db uniprot.fasta --out diamond_out_vs_12cpu_7Kseqs.tsv --log --very-sensitive --threads 120

Let me know if I'm missing anything.

I'm trying to achieve the fast annotation with either --very-sensitive and --ultra-sensitive.

In contrast to the above results, I also saw the execution time has decreased as we increase the number of CPU from 2CPU(16threads-4Hrs),4CPU(16threads-2Hrs) to 8CPU(44threads-1Hr) for a query size of 200sequences against 250M db.

Another general query: I have been trying to achieve the fastest possible query search with diamond since last few days, trying out multiple options/arguments. Due to the above contradictory results, I'm not able to move forward. Is there any specific argument I'm missing above to speed up the process further, my ideal scenario is to achieve 500 input query sequence search in 30mins against 250M db size, without compromising on the sensitivity (--very-sensitive/--ultra-sensitive).

Thanks for your time

Regards, Vijay N

Apr 11 '25 18:04 narsapuramvijaykumar

@bbuchfink Any inputs would be highly appreciated. Thanks in advance.

May 06 '25 19:05 narsapuramvijaykumar

I don't know why it would take longer with more threads. I would have to investigate further, ideally using your data. Also not sure about your hardware setup. Does you machine have 12 physical CPUs that each run 10 threads?

To optimize performance, you can try -c1 and a higher block size (option -b), turn off repeat masking (--masking 0). The best way to speed this up is combine more queries into one chunk. But if that's not an option you can also try to have multiple diamond processes search different parts of the database for the same query file, instead of using the maximum number of threads for one process. That can be done by manually splitting up the database or using the --multiprocessing feature.

May 07 '25 15:05 bbuchfink

@narsapuramvijaykumar I think you are oversubscribing your CPUs. What is your runtime when you match your --threads to the number of vCPUs you actually have?

You can also try adjusting --index-chunks 4 from the default of 4 to something lower like 1 or 2.

May 10 '25 01:05 heshpdx

I don't know why it would take longer with more threads. I would have to investigate further, ideally using your data. Also not sure about your hardware setup. Does you machine have 12 physical CPUs that each run 10 threads?

I'm using centos 7 and Yes, each CPU have 12 threads.

To optimize performance, you can try -c1 and a higher block size (option -b), turn off repeat masking (--masking 0). The best way to speed this up is combine more queries into one chunk. But if that's not an option you can also try to have multiple diamond processes search different parts of the database for the same query file, instead of using the maximum number of threads for one process. That can be done by manually splitting up the database or using the --multiprocessing feature.

I had tried some of the options suggested. as I'm using --very-sensitive option it barely improves the performance.

Thanks for your inputs.

May 17 '25 17:05 narsapuramvijaykumar

@narsapuramvijaykumar I think you are oversubscribing your CPUs. What is your runtime when you match your --threads to the number of vCPUs you actually have?

You can also try adjusting --index-chunks 4 from the default of 4 to something lower like 1 or 2.

My --index-chunks is 1 by default as I'm using the --very-sensitive argument. Thanks for the your inputs.

May 17 '25 17:05 narsapuramvijaykumar