diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Diamond parameters to run on HPC

Open mweberr opened this issue 10 months ago • 7 comments

Hi, I am using the local HPC facilities to perform large-scale annotation runs of metatranscriptomic NGS samples against NCBI NR database . Each sample is roughly 1GB (~15Million reads) mapping against NR (~350 GB protein sequences).

The current running time exceeds 48h and I would like to reduce the running time using threads, memory, -c and -b parameter. SLURM_mem = 32GB SLURM_cpus=32

diamond blastx -d NR.dmnd -q input.fastq.gz -p 32 -c1 -k 1

Can you recommend optimal values based on the given input data ? THis would help me a lot to avoid wasting time with parameter benchmarks.

Best, Michael

mweberr avatar Feb 13 '25 14:02 mweberr

-c1 is good, you can try a higher block size like -b6, if you can assign more memory to a task. 32 threads per task seems reasonable but could be increased. Two options worth a try could be --iterate and -g (like -g 100). I can't tell you though how much that would help, you should benchmark that yourself.

bbuchfink avatar Feb 13 '25 14:02 bbuchfink

Thanks for the rapid feedback! I will try to increase SLURM_mem to 128 GB. In case of more memory is even -b8 reasonable?

The parameter -g is very interesting as it affects the local alignment. Should it be fitted to the length of the input sequences ?

mweberr avatar Feb 13 '25 15:02 mweberr

-b8 should be slightly faster than -b6 but the gains are probably pretty marginal. The parameter of -g is the number of targets that will be extended for each query, so no it does not really depend on the sequence length.

bbuchfink avatar Feb 13 '25 15:02 bbuchfink

Another hint: the best way to speed this up would be to first cluster the database. Diamond now has the feature to do it.

bbuchfink avatar Feb 14 '25 12:02 bbuchfink

Thanks for the new feature! Two questions: Is the --approx-id parameters a approximation similar to CDHIT -c parameter ? (identity threshold)

Has someone already benchmarked the clustering of the NR database ? To make it work, it would be important that clusters contain sequences of the same genus in most cases.

mweberr avatar Feb 14 '25 13:02 mweberr

I just realised that fastq.gz are not supported by the current version 2.11. For performance reasons is it better to use input fasta ?

mweberr avatar Feb 14 '25 14:02 mweberr

Is the --approx-id parameters a approximation similar to CDHIT -c parameter ? (identity threshold)

Yes.

Has someone already benchmarked the clustering of the NR database ?

This is the number of clusters for 826052433 input sequences running diamond linclust, the runtime on a 128-core node is between 6-20h.

id% clusters
90 380670766
80 244939256
70 166007477
60 116931790
50 86216486
40 66605982
30 54766718
20 50723066

You can also check out this paper https://journals.asm.org/doi/10.1128/msystems.01408-21

To make it work, it would be important that clusters contain sequences of the same genus in most cases.

You would have to cluster at a conservative 80%-90% cutoff most likely and check if that's the case, I haven't looked into it.

I just realised that fastq.gz are not supported by the current version 2.11. For performance reasons is it better to use input fasta ?

That should be supported actually, can you tell me how to reproduce the error? You can convert to fasta first, it should not matter for the performance.

bbuchfink avatar Feb 17 '25 10:02 bbuchfink