Prefilter process is Killed during nucleotide search (--search-type 3)
Dear MMseqs2 team,
I am unable to successfully run nucleotide-vs-nucleotide searches for taxonomic annotation.
Environment:
MMseqs2 Version: 18.8cc5c
OS: Linux (HPC environment)
Installation method: Conda
Bug Description
When performing a nucleotide-vs-nucleotide search (--search-type 3) using a set of assembled contigs against the ref_prok_rep_genomes database, the prefilter subprocess is being terminated with a Killed signal.
I have observed this exact behavior with two different approaches, both following official documentation:
- Using the mmseqs taxonomy easy-workflow.
- Using an explicit, modular workflow (mmseqs createdb -> mmseqs search -> mmseqs lca).
This behavior could be related to the problems discussed in Issue #932?
Database Preparation
For full context, the target MMseqs2 database was created from a local BLAST database (ref_prok_rep_genomes) and the NCBI taxdump, following the standard procedure outlined in the MMseqs2 User Guide.
# 1. Download NCBI taxdump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir taxonomy && tar -xzf taxdump.tar.gz -C taxonomy
# 2. Extract FASTA and mapping file from BLAST DB
blastdbcmd \
-db ref_prok_rep_genomes \
-entry all > ref_prok_rep_genomes.fna
blastdbcmd \
-db ref_prok_rep_genomes \
-entry all \
-outfmt "%a %T" > ref_prok_rep_genomes.taxidmapping
# 3. Create the MMseqs2 sequence database
mmseqs createdb \
ref_prok_rep_genomes.fna \
ref_prok_rep_genomes_db \
--dbtype 2
# 4. Create the final taxonomically-annotated database
mmseqs createtaxdb \
ref_prok_rep_genomes_db \
tmp_taxdb \
--ncbi-tax-dump taxonomy/ \
--tax-mapping-file ref_prok_rep_genomes.taxidmapping
Steps to Reproduce
The query is a standard set of metagenomic contigs.
# Step 1: Create query database
mmseqs createdb \
path/to/contigs.fna \
path/to/queryDB \
--compressed 1 \
--dbtype 2
# Step 2: Perform nucleotide search
mmseqs search \
path/to/queryDB \
path/to/ref_prok_rep_genomes_db \
path/to/search_results.db \
path/to/tmp_dir \
--split-memory-limit 250G \
--max-seq-len 300000000 \
--search-type 3 \
-s 4.0 \
--compress 1
# Step 3: LCA
mmseqs lca \
path/to/ref_prok_rep_genomes_db \
path/to/search_results.db \
path/to/lca.db \
--tax-lineage 1
# Step 4: Create TSV report
mmseqs createtsv \
path/to/queryDB \
path/to/lca.db \
path/to/tax.tsv \
--compressed 1
# Step 5: Generate Kraken-style report
mmseqs taxonomyreport \
path/to/ref_prok_rep_genomes_db \
path/to/lca.db \
path/to/tax.report \
--report-mode 0
# Step 6: Generate Krona report
mmseqs taxonomyreport \
path/to/ref_prok_rep_genomes_db \
path/to/lca.db \
path/to/tax.html \
--report-mode 1
Observed Behavior
The workflow fails during the prefilter step. The log output shows that the process is Killed after estimating memory consumption and starting the first of three prefiltering steps.
Query database size: 19348 type: Nucleotide
Target split mode. Searching through 3 splits
Estimated memory consumption: 222G
Target database size: 1102829 type: Nucleotide
The output of the prefilter cannot be compressed during target split mode. Prefilter result will not be compressed.
Process prefiltering step 1 of 3
Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
[=================================================================]
/path/to/blastp.sh: line 144: 1652760 Killed $RUNNER "$MMSEQS" prefilter "$INPUT" "$TARGET" "$TMP_PATH/pref_$STEP" $PREFILTER_PAR -s "$SENS"
Error: Prefilter died
Error: Search step died
Further Questions
Could you clarify the behavior of the --compress 1 flag? Is it safe to use this flag at every possible step (createdb, search, etc.)?
What are the best practices fro nucleotide-vs-nucleotide searches?
Thank you for your help!
It seems like this got killed by the OS because of memory allocation. Did you allocate enough resources for the job?
Hi @martin-steinegger,
Thank you for the quick reply!
Following up on your suggestion, I have conducted further tests regarding the memory allocation.
My initial attempts on standard batch nodes with 250GB of RAM also failed with a Segmentation fault (core dumped) immediately after the prefilter step completed. To investigate this further, I ran the analysis with a single FASTQ read against the core_nt database on a large-memory node with 3TB of RAM.
On the 3TB node, the prefilter step again completed successfully, splitting the target database into 3 chunks as expected. However, the process subsequently failed at the exact same point with the same Segmentation fault.
The error from the log is:
.../search_tmp/.../blastp.sh: line 144: 2380492 Segmentation fault (core dumped) $RUNNER "$MMSEQS" "${ALIGN_MODULE}" "$INPUT" "$TARGET${ALIGNMENT_DB_EXT}" "$TMP_PATH/pref_$STEP" "$3" $ALIGNMENT_PAR Error: Alignment died Error: Search step died
The job's peak memory usage was approximately 2TB, which occurred during the prefilter stage. As a control, the same workflow runs successfully against the ref_prok_rep_genomes database, with a peak memory usage of ~660G. Changing the MMseqs2 temporary directory location did not alter the outcome.
Since the prefilter completes and the failure occurs in the alignment module across different memory configurations, I have the following questions:
- Could the segmentation fault in the alignment step be caused by an issue other than memory, such as a buffer overflow from long sequences?
- Is there a recommended method for splitting the database or the results for the alignment step to manage its resource requirements, separate from the prefilter splitting?
- Are there specific parameters for the alignment module that should be adjusted when working with a database as large as core_nt?
Thank you for your help!
Thank you for the update. The crash now appears to be caused by a bug. I assume its due to the length of the queries. If the metagenomic contigs are too long then you actually perform a genome to genome alignment, which we do not support. You could cut the queries into smaller pieces and try to search them.
To really understand the current issue, we need to isolate the specific query that triggers the crash. One approach is to offset the .index file of the query database and progressively reduce its size, for example, start with the first 25% of lines from the file. If the error still occurs, cut it down to 12.5%, and so on until we figure out the problematic entry. Since you already have a prefilter result you can just restart the computational for each run by providing your original command search /dev/shm/mmseqs_tmp_mg_22110148-c629-4085-8c6e-8f602600b7da/queryDB /mnt/aiongpfs/projects/shared_lih/data_transfer/mmseqs2/core_nt/core_nt /dev/shm/mmseqs_tmp_mg_22110148-c629-4085-8c6e-8f602600b7da/search_results.db /dev/shm/mmseqs_tmp_mg_22110148-c629-4085-8c6e-8f602600b7da/search_tmp --threads 112 --split-memory-limit 0 --max-seq-len 300000000 --search-type 3 -s 4.0 --compressed 1 -v 3 and --force-reuse
A quick clarifying question to ensure I'm on the right track.
You suggested the crash might be due to long metagenomic contigs in the query. However, I reproduced the exact same segmentation fault during the alignment step when my query was just a single 142bp FASTQ read.
In contrast, the core_nt target database does contain extremely long sequences. For context, here are its stats:
- Total sequences: ~117 million
- Total length: ~894 Gbp
- Average length: ~7.6 kbp
- Maximum length: ~100 Mbp
Given this, could the bug be triggered by the length of the target sequences that are being aligned against, rather than the length of the query itself? I want to confirm whether I should focus my debugging efforts on splitting the query database, as you suggested, or if the issue might be with specific long entries in the target database.
But indeed, I do plan to also classify metagenomic contigs which could be very large.
Thanks again for your help!