Hi DIAMOND developers,

I am running DIAMOND using the distributed multiprocessing mode on a large HPC cluster. The initialization step (--mp-init) finishes without errors, but the actual distributed run (--multiprocessing) frequently ends with Segmentation fault (core dumped) across many nodes.

I am not sure if this is caused by my job scripts, block-size configuration, or something related to file I/O on our cluster.

I would greatly appreciate your help.

📌 System Environment

HPC cluster with Slurm
CPU nodes: 192 cores each
Parallel filesystem: SSDFS (shared)
Local node storage: /tmp
DIAMOND version: v2.1.13
Query FASTA size: extremely large (3.3B sequences, ~400 GB)
Database size: ~450 GB

📌 What I observe

During distributed run, DIAMOND processes correctly generate work packages and begin computing. Then, after completing a set of reference blocks, many nodes crash with:

Segmentation fault (core dumped)
Error: stoull

This happens across multiple compute nodes, for multiple blocks.

📌 What I already tried

✔ Reduced --block-size from default to 0.5, 0.2 ✔ Switched tmpdir from parallel FS to node-local /tmp ✔ Only use one task per node (recommended by docs) ✔ Used DIAMOND v2.1.13 (precompiled) ✔ Verified PTMP is shared and accessible from all nodes

Segfaults continue to occur.

📌 My mp-init script (works fine)

#!/bin/bash
#SBATCH --job-name=diamond_mp_init
#SBATCH --partition=fata
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=192
#SBATCH --exclusive
#SBATCH --output=diamond_mp_init_%j.log
#SBATCH --error=diamond_mp_init_%j.log

export PATH="/share/home/tjhgy/apps/diamond-v2.1.13:$PATH"
diamond --version

export PTMP=/ssdfs/datahome/tjhgy/diamond_parallel/ptmp
export TMP=/tmp
mkdir -p "$PTMP" "$TMP"

DB="/share/home/tjhgy/yunwei/diamond/makedb_test/test.dmnd"
QUERY="/share/home/tjhgy/yunwei/diamond/makedb_test/test.faa"

echo "===== Diamond mp-init starting ====="
date

diamond blastp \
   -d "$DB" \
   -q "$QUERY" \
   -f 6 qseqid sseqid corrected_bitscore \
   --approx-id 30 \
   --query-cover 90 \
   -k 1000 \
   -c 1 \
   --fast \
   --multiprocessing \
   --mp-init \
   --parallel-tmpdir "$PTMP" \
   --tmpdir "$TMP" \
   --block-size 0.2 \
   -o mp_init.out

echo "===== mp-init Completed ====="
date

📌 My distributed run script (Segfault occurs)

#!/bin/bash
#SBATCH --job-name=diamond_mp_run
#SBATCH --partition=fata
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=192
#SBATCH --exclusive
#SBATCH --time=7-00:00:00
#SBATCH --output=diamond_mp_run_%j.log
#SBATCH --error=diamond_mp_run_%j.log

export PATH="/share/home/tjhgy/apps/diamond-v2.1.13:$PATH"
diamond --version

DB="/share/home/tjhgy/yunwei/diamond/makedb_test/test.dmnd"
QUERY="/share/home/tjhgy/yunwei/diamond/makedb_test/test.faa"

export PTMP=/ssdfs/datahome/tjhgy/diamond_parallel/ptmp
export TMP=/tmp

OUTPREFIX="round2_out"

echo "===== Diamond Round 2 Distributed Run ====="
echo "DB: $DB"
echo "PTMP: $PTMP"
echo "TMP:  $TMP"
date

srun diamond blastp \
    -d "$DB" \
    -q "$QUERY" \
    -f 6 qseqid sseqid corrected_bitscore \
    --approx-id 30 \
    --query-cover 90 \
    -k 1000 \
    -c 1 \
    --fast \
    --multiprocessing \
    --parallel-tmpdir "$PTMP" \
    --tmpdir "$TMP" \
    --block-size 0.2 \
    -o ${OUTPREFIX}_${SLURM_PROCID}.out

echo "===== Distributed run completed ====="
date

📌 Sample error log

Computing alignments... [3.9s]
srun: error: nodeXX: task 3: Segmentation fault (core dumped)
Error: stoull
srun: error: nodeXY: task 7: Segmentation fault (core dumped)

📌 My Questions

Could you please advise me on how to resolve this issue and successfully complete the distributed multiprocessing run without segmentation faults?

Any guidance would be greatly appreciated. Thank you very much for your work on DIAMOND!

Nov 20 '25 08:11 hughuang12

I tried to reproduce it on a small db with the same commands and it worked. Can you tell me if the error occurs for a small example, e.g. self-alignment of swissprot?

Nov 21 '25 09:11 bbuchfink

Thank you very much, @bbuchfink, for your previous response.

To follow up on the question initially raised by my colleague @Hughuang12, we tested distributed multiprocessing with a modified SLURM script. On a small database (~5 million proteins), the following script runs efficiently (runtime ≈ 27 hours):

bash
#!/bin/bash
#SBATCH --job-name=diamond_round2_mp_compute
#SBATCH --partition=intel
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=96
#SBATCH --exclusive
#SBATCH --time=7-00:00:00
#SBATCH --output=%x_%j.log
#SBATCH --error=%x_%j.log
############################################
DB="makedb_test/test.dmnd"
QUERY="makedb_test/test.faa"
export PTMP=diamond_parallel/ptmp_round1
export TMP=/tmp

OUTPREFIX=full_round2_out

module purge
module load gcc/11.4.0 intel/oneapi/21.4
export SLURM_HINT=multithread

echo "===== Diamond Round 2 Distributed Multiprocessing ====="
echo "DB: $DB"
echo "PTMP: $PTMP"
echo "TMP:  $TMP"
date
echo "======================================================="

srun diamond blastp \
    -d "$DB" \
    -q "$QUERY" \
    -f 6 qseqid sseqid corrected_bitscore \
    --approx-id 30 \
    --query-cover 90 \
    -k 1000 \
    -c 1 \
    --fast \
    --multiprocessing \
    --parallel-tmpdir "$PTMP" \
    --tmpdir "$TMP" \
    --block-size 2 \
    -o "$OUTPREFIX"

date
echo "===== Round 2 multiprocessing DONE ====="

However, when we run the same workflow on a much larger database (~3 billion proteins), the runtime increases dramatically. In the log we sometimes see very long alignment phases, for example:

Computing alignments... [7297.49s]

We would really appreciate your advice on a few technical points:

1. Parallel configuration

Each node has 192 hardware threads. For large databases, which configuration would you generally recommend?

1 process × 192 threads
4 processes × 48 threads

or some other layout?

2. Parallelization behavior

With --block-size 2, the large database is split into ~150 reference blocks. In the log we only see sequential messages such as:

Processing query block 4, reference block 32/150
Processing query block 4, reference block 33/150
Processing query block 4, reference block 34/150
…

From the log output, it looks as if each query–reference block pair is processed one after another, and we do not see any explicit indication of multiple blocks being processed in parallel (either across queries or references).

Could you clarify how multiprocessing is applied internally in DIAMOND?

Is parallelism primarily over query blocks, reference blocks, or both?
Are multiple reference blocks for the same query block processed concurrently, or is the parallelism mainly within a single block (e.g. over threads)?
Is the apparent sequential logging expected even when computations are running in parallel?

A better understanding of how block-level parallelization is implemented (and how it is reflected in the logs) would help us choose an appropriate job partitioning strategy on our cluster.

3. Parameter tuning

For multi-billion-sequence databases, do you think performance is strongly affected by parameters such as:

--block-size
--parallel-tmpdir / --tmpdir and filesystem layout
SLURM settings like --ntasks-per-node and --cpus-per-task?

Any suggestions for optimizing DIAMOND on very large protein databases would be extremely helpful for us.

Thank you again for your time and for maintaining this great tool.

Nov 25 '25 06:11 linweiliarchaea

However, when we run the same workflow on a much larger database (~3 billion proteins), the runtime increases dramatically. In the log we sometimes see very long alignment phases, for example:

Computing alignments... [7297.49s]

We would really appreciate your advice on a few technical points:

1. Parallel configuration

Each node has 192 hardware threads. For large databases, which configuration would you generally recommend?

4 processes × 48 threads

Should be a bit faster I assume

or some other layout?

2. Parallelization behavior

With --block-size 2, the large database is split into ~150 reference blocks. In the log we only see sequential messages such as:

Processing query block 4, reference block 32/150

Processing query block 4, reference block 33/150

Processing query block 4, reference block 34/150

…

From the log output, it looks as if each query–reference block pair is processed one after another, and we do not see any explicit indication of multiple blocks being processed in parallel (either across queries or references).

If all your worker processes write to the same log this output is expected. They should be working on different block combinations in parallel, if it is only one block after the other then something is going wrong and you only see output from one worker.

Could you clarify how multiprocessing is applied internally in DIAMOND?

Is parallelism primarily over query blocks, reference blocks, or both?

Both, each worker will fetch the next reference block for the current query block and when that is done, start the next query block.

Are multiple reference blocks for the same query block processed concurrently, or is the parallelism mainly within a single block (e.g. over threads)?

Yes, worker processes process different block combinations in parallel.

Is the apparent sequential logging expected even when computations are running in parallel?

You should see the logs of many blocks interleaved when all workers write to the same log.

A better understanding of how block-level parallelization is implemented (and how it is reflected in the logs) would help us choose an appropriate job partitioning strategy on our cluster.

3. Parameter tuning

For multi-billion-sequence databases, do you think performance is strongly affected by parameters such as:

--block-size

A higher block size can help but probably not by much.

--parallel-tmpdir / --tmpdir and filesystem layout

SLURM settings like --ntasks-per-node and --cpus-per-task?

No probably not.

Any suggestions for optimizing DIAMOND on very large protein databases would be extremely helpful for us.

For extension heavy computations like this you could try the experimental --anchored-swipe option (also use the latest release for this). Native compilation from source can also help a bit.

Thank you again for your time and for maintaining this great tool.

Nov 27 '25 15:11 bbuchfink

Thanks for getting back to us earlier, @bbuchfink .

We would like to report two types of errors we encountered when running DIAMOND with multiprocessing.

1. Multiprocessing with 10 nodes fails to merge final results of a query block

When running with 10 nodes, all worker processes start correctly and write to the shared log (as expected in log1) However, near the end of processing query block 1, the run fails with errors such as stoull and segmentation faults. An excerpt of the log1 is shown below:

Processing query block 1, reference block 300/300, shape 1/1.
Building reference seed array...  [0.404s]
Building query seed array...  [0.604s]
Computing hash join...  [0.58s]
Masking low complexity seeds...  [0.146s]
Searching alignments...  [0.182s]
Deallocating memory...  [0s]
Deallocating buffers...  [0.033s]
Clearing query masking...  [0.442s]
Opening temporary output file...  [0.001s]
Computing alignments...  [17.753s]
Deallocating reference...  [0.011s]
Computing alignments...  [21.279s]
Deallocating reference...  [0.014s]
Error: stoull
srun: error: cpui185: task 3: Exited with exit code 1
Masking reference...  [0.57s]
Initializing dictionary...  [0.021s]
Initializing temporary storage...  [0.024s]
Building reference histograms...  [0.719s]
Allocating buffers...  [0s]
Processing query block 1, reference block 300/300, shape 1/1.
Building reference seed array...  [0.39s]
Building query seed array...  [0.384s]
Computing hash join...  [0.543s]
Masking low complexity seeds...  [0.172s]
Searching alignments...  [0.187s]
Deallocating memory...  [0s]
Deallocating buffers...  [0.043s]
Clearing query masking...  [0.492s]
Opening temporary output file...  [0.002s]
Computing alignments...  [16.077s]
Deallocating reference...  [0.007s]
Error: stoull
srun: error: cpui181: task 0: Exited with exit code 1
Computing alignments...  [18.423s]
Deallocating reference...  [0.008s]
srun: error: cpui183: task 1: Segmentation fault (core dumped)
Thu Nov 27 13:32:30 CST 2025

This suggests that the workers finish their tasks but fail when merging results.

2. Multiprocessing with 2 nodes automatically falls back to single-node execution

When using 2 nodes, the run starts with two worker processes, but later query blocks switch to single-worker mode automatically (log2). This appears to happen without any explicit error message.

If DIAMOND multiprocessing cannot be used reliably in our environment, would splitting the query FASTA file manually into multiple chunks and running DIAMOND on each chunk in parallel produce results identical to running the entire query file without splitting?

We would like to confirm whether manual parallelization is a safe alternative.

Nov 28 '25 10:11 linweiliarchaea

I will try to fix the errors. Yes, you can just split the query file and get the same results. You can also manually split the database file and merge the results. This will not produce identical results unless the block sizes are exactly the same as the aligner produces, but the results will not be any less correct either. When doing this, you should set the database size in letters using --dbsize because it is used for computing e-values.

Nov 28 '25 11:11 bbuchfink

Dear Dr. Buchfink (@bbuchfink),

We have completed an all-vs-all alignment for a large dataset of approximately 3 billion proteins, and we are now attempting to determine the representative proteins using the greedy vertex cover algorithm. The command we used is as follows:

seqkit faidx round_1_representive.faa
DB="round_1_representive.faa.fai"
EDGES="./out.tsv"
OUT="./clusters_round_2.tsv"

diamond greedy-vertex-cover \
    --edges $EDGES \
    -d $DB \
    --edge-format triplet \
    --threads 192 \
    -o $OUT

However, we encountered the following error, which seems to suggest that the input protein database ("round_1_representive.faa.fai") and the edge file ("out.tsv," approximately 3.3TB) are too large:

diamond version 2.1.13
===== Diamond greedy-vertex-cover starting =====
Thu Dec  4 09:27:48 CST 2025
diamond v2.1.13.167 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support, and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 192
Coverage cutoff: 80%
Reading mapping file...  [1726.85s]
#OIds: 3368888035
Counting input lines...  [1353.48s]
#Lines: 61307535384
Allocating memory...  [0s]
Reading input lines...  [1120.68s]
Making flat array...  [146.881s]
/var/spool/slurmd/job804556/slurm_script: line 40: 3077823 Segmentation fault      (core dumped) diamond greedy-vertex-cover --edges $EDGES -d $DB --edge-format triplet --threads 192 -o $OUT

Could you suggest any methods to handle such large files, or recommend alternative approaches for large-scale network clustering, such as HipMCL (https://bitbucket.org/azadcse/hipmcl/wiki/Home)?

Thank you for your assistance.

Dec 04 '25 05:12 linweiliarchaea

@linweiliarchaea It should be possible to fix this adding --connected-component-depth 0 to the greedy-vertex-cover call.

Dec 04 '25 17:12 bbuchfink

Segmentation fault during distributed multiprocessing run on HPC cluster (DIAMOND v2.1.13)

📌 System Environment

📌 What I observe

📌 What I already tried

📌 My mp-init script (works fine)

📌 My distributed run script (Segfault occurs)

📌 Sample error log

📌 My Questions