Segmentation fault during distributed multiprocessing run on HPC cluster (DIAMOND v2.1.13)
Hi DIAMOND developers,
I am running DIAMOND using the distributed multiprocessing mode on a large HPC cluster.
The initialization step (--mp-init) finishes without errors, but the actual distributed run (--multiprocessing) frequently ends with Segmentation fault (core dumped) across many nodes.
I am not sure if this is caused by my job scripts, block-size configuration, or something related to file I/O on our cluster.
I would greatly appreciate your help.
📌 System Environment
- HPC cluster with Slurm
- CPU nodes: 192 cores each
- Parallel filesystem: SSDFS (shared)
- Local node storage:
/tmp - DIAMOND version: v2.1.13
- Query FASTA size: extremely large (3.3B sequences, ~400 GB)
- Database size: ~450 GB
📌 What I observe
During distributed run, DIAMOND processes correctly generate work packages and begin computing. Then, after completing a set of reference blocks, many nodes crash with:
Segmentation fault (core dumped)
Error: stoull
This happens across multiple compute nodes, for multiple blocks.
📌 What I already tried
✔ Reduced --block-size from default to 0.5, 0.2
✔ Switched tmpdir from parallel FS to node-local /tmp
✔ Only use one task per node (recommended by docs)
✔ Used DIAMOND v2.1.13 (precompiled)
✔ Verified PTMP is shared and accessible from all nodes
Segfaults continue to occur.
📌 My mp-init script (works fine)
#!/bin/bash
#SBATCH --job-name=diamond_mp_init
#SBATCH --partition=fata
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=192
#SBATCH --exclusive
#SBATCH --output=diamond_mp_init_%j.log
#SBATCH --error=diamond_mp_init_%j.log
export PATH="/share/home/tjhgy/apps/diamond-v2.1.13:$PATH"
diamond --version
export PTMP=/ssdfs/datahome/tjhgy/diamond_parallel/ptmp
export TMP=/tmp
mkdir -p "$PTMP" "$TMP"
DB="/share/home/tjhgy/yunwei/diamond/makedb_test/test.dmnd"
QUERY="/share/home/tjhgy/yunwei/diamond/makedb_test/test.faa"
echo "===== Diamond mp-init starting ====="
date
diamond blastp \
-d "$DB" \
-q "$QUERY" \
-f 6 qseqid sseqid corrected_bitscore \
--approx-id 30 \
--query-cover 90 \
-k 1000 \
-c 1 \
--fast \
--multiprocessing \
--mp-init \
--parallel-tmpdir "$PTMP" \
--tmpdir "$TMP" \
--block-size 0.2 \
-o mp_init.out
echo "===== mp-init Completed ====="
date
📌 My distributed run script (Segfault occurs)
#!/bin/bash
#SBATCH --job-name=diamond_mp_run
#SBATCH --partition=fata
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=192
#SBATCH --exclusive
#SBATCH --time=7-00:00:00
#SBATCH --output=diamond_mp_run_%j.log
#SBATCH --error=diamond_mp_run_%j.log
export PATH="/share/home/tjhgy/apps/diamond-v2.1.13:$PATH"
diamond --version
DB="/share/home/tjhgy/yunwei/diamond/makedb_test/test.dmnd"
QUERY="/share/home/tjhgy/yunwei/diamond/makedb_test/test.faa"
export PTMP=/ssdfs/datahome/tjhgy/diamond_parallel/ptmp
export TMP=/tmp
OUTPREFIX="round2_out"
echo "===== Diamond Round 2 Distributed Run ====="
echo "DB: $DB"
echo "PTMP: $PTMP"
echo "TMP: $TMP"
date
srun diamond blastp \
-d "$DB" \
-q "$QUERY" \
-f 6 qseqid sseqid corrected_bitscore \
--approx-id 30 \
--query-cover 90 \
-k 1000 \
-c 1 \
--fast \
--multiprocessing \
--parallel-tmpdir "$PTMP" \
--tmpdir "$TMP" \
--block-size 0.2 \
-o ${OUTPREFIX}_${SLURM_PROCID}.out
echo "===== Distributed run completed ====="
date
📌 Sample error log
Computing alignments... [3.9s]
srun: error: nodeXX: task 3: Segmentation fault (core dumped)
Error: stoull
srun: error: nodeXY: task 7: Segmentation fault (core dumped)
📌 My Questions
Could you please advise me on how to resolve this issue and successfully complete the distributed multiprocessing run without segmentation faults?
Any guidance would be greatly appreciated. Thank you very much for your work on DIAMOND!
I tried to reproduce it on a small db with the same commands and it worked. Can you tell me if the error occurs for a small example, e.g. self-alignment of swissprot?
Thank you very much, @bbuchfink, for your previous response.
To follow up on the question initially raised by my colleague @Hughuang12, we tested distributed multiprocessing with a modified SLURM script. On a small database (~5 million proteins), the following script runs efficiently (runtime ≈ 27 hours):
bash
#!/bin/bash
#SBATCH --job-name=diamond_round2_mp_compute
#SBATCH --partition=intel
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=96
#SBATCH --exclusive
#SBATCH --time=7-00:00:00
#SBATCH --output=%x_%j.log
#SBATCH --error=%x_%j.log
############################################
DB="makedb_test/test.dmnd"
QUERY="makedb_test/test.faa"
export PTMP=diamond_parallel/ptmp_round1
export TMP=/tmp
OUTPREFIX=full_round2_out
module purge
module load gcc/11.4.0 intel/oneapi/21.4
export SLURM_HINT=multithread
echo "===== Diamond Round 2 Distributed Multiprocessing ====="
echo "DB: $DB"
echo "PTMP: $PTMP"
echo "TMP: $TMP"
date
echo "======================================================="
srun diamond blastp \
-d "$DB" \
-q "$QUERY" \
-f 6 qseqid sseqid corrected_bitscore \
--approx-id 30 \
--query-cover 90 \
-k 1000 \
-c 1 \
--fast \
--multiprocessing \
--parallel-tmpdir "$PTMP" \
--tmpdir "$TMP" \
--block-size 2 \
-o "$OUTPREFIX"
date
echo "===== Round 2 multiprocessing DONE ====="
However, when we run the same workflow on a much larger database (~3 billion proteins), the runtime increases dramatically. In the log we sometimes see very long alignment phases, for example:
Computing alignments... [7297.49s]
We would really appreciate your advice on a few technical points:
1. Parallel configuration
Each node has 192 hardware threads. For large databases, which configuration would you generally recommend?
- 1 process × 192 threads
- 4 processes × 48 threads
or some other layout?
2. Parallelization behavior
With --block-size 2, the large database is split into ~150 reference blocks. In the log we only see sequential messages such as:
Processing query block 4, reference block 32/150Processing query block 4, reference block 33/150Processing query block 4, reference block 34/150- …
From the log output, it looks as if each query–reference block pair is processed one after another, and we do not see any explicit indication of multiple blocks being processed in parallel (either across queries or references).
Could you clarify how multiprocessing is applied internally in DIAMOND?
- Is parallelism primarily over query blocks, reference blocks, or both?
- Are multiple reference blocks for the same query block processed concurrently, or is the parallelism mainly within a single block (e.g. over threads)?
- Is the apparent sequential logging expected even when computations are running in parallel?
A better understanding of how block-level parallelization is implemented (and how it is reflected in the logs) would help us choose an appropriate job partitioning strategy on our cluster.
3. Parameter tuning
For multi-billion-sequence databases, do you think performance is strongly affected by parameters such as:
- --block-size
- --parallel-tmpdir / --tmpdir and filesystem layout
- SLURM settings like --ntasks-per-node and --cpus-per-task?
Any suggestions for optimizing DIAMOND on very large protein databases would be extremely helpful for us.
Thank you again for your time and for maintaining this great tool.
However, when we run the same workflow on a much larger database (~3 billion proteins), the runtime increases dramatically. In the log we sometimes see very long alignment phases, for example:
Computing alignments... [7297.49s]We would really appreciate your advice on a few technical points:
1. Parallel configuration
Each node has 192 hardware threads. For large databases, which configuration would you generally recommend?
- 4 processes × 48 threads
Should be a bit faster I assume
or some other layout?
2. Parallelization behavior
With
--block-size 2, the large database is split into ~150 reference blocks. In the log we only see sequential messages such as:
Processing query block 4, reference block 32/150Processing query block 4, reference block 33/150Processing query block 4, reference block 34/150- …
From the log output, it looks as if each query–reference block pair is processed one after another, and we do not see any explicit indication of multiple blocks being processed in parallel (either across queries or references).
If all your worker processes write to the same log this output is expected. They should be working on different block combinations in parallel, if it is only one block after the other then something is going wrong and you only see output from one worker.
Could you clarify how multiprocessing is applied internally in DIAMOND?
- Is parallelism primarily over query blocks, reference blocks, or both?
Both, each worker will fetch the next reference block for the current query block and when that is done, start the next query block.
- Are multiple reference blocks for the same query block processed concurrently, or is the parallelism mainly within a single block (e.g. over threads)?
Yes, worker processes process different block combinations in parallel.
- Is the apparent sequential logging expected even when computations are running in parallel?
You should see the logs of many blocks interleaved when all workers write to the same log.
A better understanding of how block-level parallelization is implemented (and how it is reflected in the logs) would help us choose an appropriate job partitioning strategy on our cluster.
3. Parameter tuning
For multi-billion-sequence databases, do you think performance is strongly affected by parameters such as:
- --block-size
A higher block size can help but probably not by much.
- --parallel-tmpdir / --tmpdir and filesystem layout
- SLURM settings like --ntasks-per-node and --cpus-per-task?
No probably not.
Any suggestions for optimizing DIAMOND on very large protein databases would be extremely helpful for us.
For extension heavy computations like this you could try the experimental --anchored-swipe option (also use the latest release for this). Native compilation from source can also help a bit.
Thank you again for your time and for maintaining this great tool.
Thanks for getting back to us earlier, @bbuchfink .
We would like to report two types of errors we encountered when running DIAMOND with multiprocessing.
1. Multiprocessing with 10 nodes fails to merge final results of a query block
When running with 10 nodes, all worker processes start correctly and write to the shared log (as expected in log1) However, near the end of processing query block 1, the run fails with errors such as stoull and segmentation faults. An excerpt of the log1 is shown below:
Processing query block 1, reference block 300/300, shape 1/1.
Building reference seed array... [0.404s]
Building query seed array... [0.604s]
Computing hash join... [0.58s]
Masking low complexity seeds... [0.146s]
Searching alignments... [0.182s]
Deallocating memory... [0s]
Deallocating buffers... [0.033s]
Clearing query masking... [0.442s]
Opening temporary output file... [0.001s]
Computing alignments... [17.753s]
Deallocating reference... [0.011s]
Computing alignments... [21.279s]
Deallocating reference... [0.014s]
Error: stoull
srun: error: cpui185: task 3: Exited with exit code 1
Masking reference... [0.57s]
Initializing dictionary... [0.021s]
Initializing temporary storage... [0.024s]
Building reference histograms... [0.719s]
Allocating buffers... [0s]
Processing query block 1, reference block 300/300, shape 1/1.
Building reference seed array... [0.39s]
Building query seed array... [0.384s]
Computing hash join... [0.543s]
Masking low complexity seeds... [0.172s]
Searching alignments... [0.187s]
Deallocating memory... [0s]
Deallocating buffers... [0.043s]
Clearing query masking... [0.492s]
Opening temporary output file... [0.002s]
Computing alignments... [16.077s]
Deallocating reference... [0.007s]
Error: stoull
srun: error: cpui181: task 0: Exited with exit code 1
Computing alignments... [18.423s]
Deallocating reference... [0.008s]
srun: error: cpui183: task 1: Segmentation fault (core dumped)
Thu Nov 27 13:32:30 CST 2025
This suggests that the workers finish their tasks but fail when merging results.
2. Multiprocessing with 2 nodes automatically falls back to single-node execution
When using 2 nodes, the run starts with two worker processes, but later query blocks switch to single-worker mode automatically (log2). This appears to happen without any explicit error message.
If DIAMOND multiprocessing cannot be used reliably in our environment, would splitting the query FASTA file manually into multiple chunks and running DIAMOND on each chunk in parallel produce results identical to running the entire query file without splitting?
We would like to confirm whether manual parallelization is a safe alternative.
I will try to fix the errors. Yes, you can just split the query file and get the same results. You can also manually split the database file and merge the results. This will not produce identical results unless the block sizes are exactly the same as the aligner produces, but the results will not be any less correct either. When doing this, you should set the database size in letters using --dbsize because it is used for computing e-values.
Dear Dr. Buchfink (@bbuchfink),
We have completed an all-vs-all alignment for a large dataset of approximately 3 billion proteins, and we are now attempting to determine the representative proteins using the greedy vertex cover algorithm. The command we used is as follows:
seqkit faidx round_1_representive.faa
DB="round_1_representive.faa.fai"
EDGES="./out.tsv"
OUT="./clusters_round_2.tsv"
diamond greedy-vertex-cover \
--edges $EDGES \
-d $DB \
--edge-format triplet \
--threads 192 \
-o $OUT
However, we encountered the following error, which seems to suggest that the input protein database ("round_1_representive.faa.fai") and the edge file ("out.tsv," approximately 3.3TB) are too large:
diamond version 2.1.13
===== Diamond greedy-vertex-cover starting =====
Thu Dec 4 09:27:48 CST 2025
diamond v2.1.13.167 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support, and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
#CPU threads: 192
Coverage cutoff: 80%
Reading mapping file... [1726.85s]
#OIds: 3368888035
Counting input lines... [1353.48s]
#Lines: 61307535384
Allocating memory... [0s]
Reading input lines... [1120.68s]
Making flat array... [146.881s]
/var/spool/slurmd/job804556/slurm_script: line 40: 3077823 Segmentation fault (core dumped) diamond greedy-vertex-cover --edges $EDGES -d $DB --edge-format triplet --threads 192 -o $OUT
Could you suggest any methods to handle such large files, or recommend alternative approaches for large-scale network clustering, such as HipMCL (https://bitbucket.org/azadcse/hipmcl/wiki/Home)?
Thank you for your assistance.
@linweiliarchaea It should be possible to fix this adding --connected-component-depth 0 to the greedy-vertex-cover call.