Diamond Clustering randomly crashing.
I'm using Diamond clustering as part of an alignment strategy. This means I can run clustering over millions of files across a dataset. I run 32 instances of diamond in parallel. The CPU has 64c/128p so I'm not overthreading.
Randomly, diamond is abruptly terminating mid-run. The log files show it simply stops. It's never the same file. I have to trigger it by running the script in a bash loop to catch it. Error rate is estimated at 1/100000 iterations of diamond.
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
#CPU threads: 1 Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Opening the input file... [0s] Input database: /run/shm/uce-12426_gow28_ui/tmp_in.fa (2 sequences, 65 letters) Temporary directory: /run/shm/uce-12426_gow28_ui #Target sequences to report alignments for: unlimited Database: /run/shm/uce-12426_gow28_ui/tmp_in.fa (type: FASTA file, sequences: 2, letters: 65) Block size = 3200000000 Opening the input file... [0s] Opening the output file... [0s] Seeking in database... [0s] Loading query sequences... [0s] Length sorting queries... [0s] Algorithm: Double-indexed Building query histograms... [0s] Seeking in database... [0s] Seeking in database... [0s] Initializing temporary storage... [0s] Building reference histograms... [0s] Allocating buffers... [0s] Processing query block 1, reference block 1/1, shape 1/1. Building reference seed array... [0s] Building query seed array... [0s] Computing hash join... [0s] Masking low complexity seeds... [0s] Searching alignments... [0s] Deallocating memory... [0s] Deallocating buffers... [0s] Clearing query masking... [0s] Computing alignments... Loading trace points... [0s] Sorting trace points... [0s] Computing alignments...
Possibly, it's some sort of underflow error. It might be crashing when there are too few sequences in a file to cluster. I will try upping the requirement for clustering to occur.
Edit: I implemented a lower limit of 10 sequences to pass through clustering and that appears to have stopped the crashes. I will test how low I can go. (Limbo time!)
Edit2: Here is the command I am running.
diamond cluster -d {tmp_in.name} -o {tmp_result.name} --approx-id 85 --member-cover 65 --threads 1 --quiet
I'll try to reproduce it but it may not be that easy.