Diamond parameters to run on HPC
Hi, I am using the local HPC facilities to perform large-scale annotation runs of metatranscriptomic NGS samples against NCBI NR database . Each sample is roughly 1GB (~15Million reads) mapping against NR (~350 GB protein sequences).
The current running time exceeds 48h and I would like to reduce the running time using threads, memory, -c and -b parameter. SLURM_mem = 32GB SLURM_cpus=32
diamond blastx -d NR.dmnd -q input.fastq.gz -p 32 -c1 -k 1
Can you recommend optimal values based on the given input data ? THis would help me a lot to avoid wasting time with parameter benchmarks.
Best, Michael
-c1 is good, you can try a higher block size like -b6, if you can assign more memory to a task. 32 threads per task seems reasonable but could be increased. Two options worth a try could be --iterate and -g (like -g 100). I can't tell you though how much that would help, you should benchmark that yourself.
Thanks for the rapid feedback! I will try to increase SLURM_mem to 128 GB. In case of more memory is even -b8 reasonable?
The parameter -g is very interesting as it affects the local alignment. Should it be fitted to the length of the input sequences ?
-b8 should be slightly faster than -b6 but the gains are probably pretty marginal. The parameter of -g is the number of targets that will be extended for each query, so no it does not really depend on the sequence length.
Another hint: the best way to speed this up would be to first cluster the database. Diamond now has the feature to do it.
Thanks for the new feature! Two questions: Is the --approx-id parameters a approximation similar to CDHIT -c parameter ? (identity threshold)
Has someone already benchmarked the clustering of the NR database ? To make it work, it would be important that clusters contain sequences of the same genus in most cases.
I just realised that fastq.gz are not supported by the current version 2.11. For performance reasons is it better to use input fasta ?
Is the --approx-id parameters a approximation similar to CDHIT -c parameter ? (identity threshold)
Yes.
Has someone already benchmarked the clustering of the NR database ?
This is the number of clusters for 826052433 input sequences running diamond linclust, the runtime on a 128-core node is between 6-20h.
| id% | clusters |
|---|---|
| 90 | 380670766 |
| 80 | 244939256 |
| 70 | 166007477 |
| 60 | 116931790 |
| 50 | 86216486 |
| 40 | 66605982 |
| 30 | 54766718 |
| 20 | 50723066 |
You can also check out this paper https://journals.asm.org/doi/10.1128/msystems.01408-21
To make it work, it would be important that clusters contain sequences of the same genus in most cases.
You would have to cluster at a conservative 80%-90% cutoff most likely and check if that's the case, I haven't looked into it.
I just realised that fastq.gz are not supported by the current version 2.11. For performance reasons is it better to use input fasta ?
That should be supported actually, can you tell me how to reproduce the error? You can convert to fasta first, it should not matter for the performance.