diamond Diamond parameters to run on HPC

Hi, I am using the local HPC facilities to perform large-scale annotation runs of metatranscriptomic NGS samples against NCBI NR database . Each sample is roughly 1GB (~15Million reads) mapping against NR (~350 GB protein sequences).

The current running time exceeds 48h and I would like to reduce the running time using threads, memory, -c and -b parameter. SLURM_mem = 32GB SLURM_cpus=32

diamond blastx -d NR.dmnd -q input.fastq.gz -p 32 -c1 -k 1

Can you recommend optimal values based on the given input data ? THis would help me a lot to avoid wasting time with parameter benchmarks.

Best, Michael

Feb 13 '25 14:02 mweberr

-c1 is good, you can try a higher block size like -b6, if you can assign more memory to a task. 32 threads per task seems reasonable but could be increased. Two options worth a try could be --iterate and -g (like -g 100). I can't tell you though how much that would help, you should benchmark that yourself.

Feb 13 '25 14:02 bbuchfink

Thanks for the rapid feedback! I will try to increase SLURM_mem to 128 GB. In case of more memory is even -b8 reasonable?

The parameter -g is very interesting as it affects the local alignment. Should it be fitted to the length of the input sequences ?

Feb 13 '25 15:02 mweberr

-b8 should be slightly faster than -b6 but the gains are probably pretty marginal. The parameter of -g is the number of targets that will be extended for each query, so no it does not really depend on the sequence length.

Feb 13 '25 15:02 bbuchfink

Another hint: the best way to speed this up would be to first cluster the database. Diamond now has the feature to do it.

Feb 14 '25 12:02 bbuchfink

Thanks for the new feature! Two questions: Is the --approx-id parameters a approximation similar to CDHIT -c parameter ? (identity threshold)

Has someone already benchmarked the clustering of the NR database ? To make it work, it would be important that clusters contain sequences of the same genus in most cases.

Feb 14 '25 13:02 mweberr

I just realised that fastq.gz are not supported by the current version 2.11. For performance reasons is it better to use input fasta ?

Feb 14 '25 14:02 mweberr

Is the --approx-id parameters a approximation similar to CDHIT -c parameter ? (identity threshold)

Yes.

Has someone already benchmarked the clustering of the NR database ?

This is the number of clusters for 826052433 input sequences running diamond linclust, the runtime on a 128-core node is between 6-20h.

id%	clusters
90	380670766
80	244939256
70	166007477
60	116931790
50	86216486
40	66605982
30	54766718
20	50723066

You can also check out this paper https://journals.asm.org/doi/10.1128/msystems.01408-21

To make it work, it would be important that clusters contain sequences of the same genus in most cases.

You would have to cluster at a conservative 80%-90% cutoff most likely and check if that's the case, I haven't looked into it.

I just realised that fastq.gz are not supported by the current version 2.11. For performance reasons is it better to use input fasta ?

That should be supported actually, can you tell me how to reproduce the error? You can convert to fasta first, it should not matter for the performance.

Feb 17 '25 10:02 bbuchfink