DIAMOND blastx extremely slow during "Computing alignments..." stage
Hi DIAMOND developers, I’m encountering severe performance issues when running diamond blastx, especially during the Computing alignments... phase. The runtime is unreasonably long, and I’d appreciate any advice on optimization or troubleshooting. Environment & Command: • System: Linux server • DIAMOND version: v2.1.13.167 • Database: uniref90.dmnd (~184 million sequences) • Query file:TC.p_ctg.fa • Threads: 16
Command used:
diamond blastx
--db uniref90.dmnd
--query TC.p_ctg.fa
--out p_ctg_vs_uniref90.daa
--threads 16
--evalue 1e-5
--max-target-seqs 1
Running content:
diamond v2.1.13.167 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
#CPU threads: 16 Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory: #Target sequences to report alignments for: 1 Opening the database... [0.061s] Database: uniref90.dmnd (type: Diamond database, sequences: 184146434, letters: 64621732275) Block size = 2000000000 Opening the input file... [0.029s] Opening the output file... [0s] Loading query sequences... [27.271s] Masking queries... [34.56s] Algorithm: Double-indexed Building query histograms... [4.559s] Seeking in database... [0s] Loading reference sequences... [10.973s] Masking reference... [7.358s] Initializing dictionary... [0.001s] Initializing temporary storage... [0s] Building reference histograms... [7.224s] Allocating buffers... [0.016s] Processing query block 1, reference block 1/33, shape 1/2, index chunk 1/4. Building reference seed array... [4.387s] Building query seed array... [2.52s] Computing hash join... [1.407s] Masking low complexity seeds... [0.43s] Searching alignments... [2.908s] Deallocating memory... [0s] ...... Processing query block 1, reference block 1/33, shape 2/2, index chunk 4/4. Building reference seed array... [3.836s] Building query seed array... [2.059s] Computing hash join... [1.143s] Masking low complexity seeds... [0.375s] Searching alignments... [2.494s] Deallocating memory... [0s] Deallocating buffers... [0.447s] Clearing query masking... [0.344s] Opening temporary output file... [0s] Computing alignments... [117232s] Deallocating reference... [0.118s]
**Problem Description:**The program runs normally during initial stages (loading sequences, masking, indexing), but slows down drastically at Computing alignments...: Reference block 1/33 took ~117,232 seconds (~32.5 hours) Reference block 2/33 took ~180,257 seconds (~50 hours) At this rate, completing all 33 blocks could take weeks, which seems abnormal.
Relevant Log Snippet: Deallocating buffers... [0.447s] Clearing query masking... [0.344s] Opening temporary output file... [0s] Computing alignments... [117232s] # Extremely slow here Deallocating reference... [0.118s]
Additional Information: seqkit stats TC.p_ctg.fa file format type num_seqs sum_len min_len avg_len max_len TC.p_ctg.fa FASTA DNA 216 1,418,995,089 14,752 6,569,421.7 146,169,120 Genome assembly command:hifiasm -o TC_primary -t 32 –l2 –primary TC__M.fastq.gz 2> TC_p.log
My Questions: Could you please advise me on how to resolve this issue? How can I speed up the Computing alignments... phase so that the job can complete within a reasonable timeframe? Are there any parameter adjustments, preprocessing steps, or system configurations I should consider to avoid such extreme slowdowns? Is there a way to diagnose whether this is due to query complexity, memory handling, or something else? Any guidance that could help me successfully complete this DIAMOND run would be greatly appreciated.
Any guidance would be greatly appreciated. Thank you very much for your work on DIAMOND!
Diamond is not well adapted to very long queries of several megabases.
You can try frameshift alignment mode, e.g. -F 15 --range-culling --top 10. (worth a try, but will not solve necessarily)
You can split the queries into overlapping pieces of e.g. 10k-100k bases length.
You can first do gene calling/extract ORFs and then align those.
Thank you so much for your help! I split the query sequences into overlapping segments of approximately 100,000 base pairs in length, and now the analysis speed has been significantly improved.