MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

mmseqs createtaxdb unexpectedly killed

Open charlesfoster opened this issue 1 year ago • 1 comments

I'm trying to use a clustered version of the NR database for taxonomy assignment but am running into some issues. Any assistance would be appreciated.

Expected Behavior

When running mmseqs createtaxdb db_name tmp --tax-mapping-file taxid.map, I would expect to successfully create a seqTaxDB as per here.

Current Behavior

The job begins but is unexpectedly killed (see mmseqs output section below).

MMseqs Output (for bugs)

cfos@pop-os:/data/clustered_nr$ mmseqs createtaxdb nr_rep_seq_db tmp --tax-mapping-file '/data/clustered_nr/nr_rep_seq_to_taxid.map' -v 3
Create directory tmp
createtaxdb nr_rep_seq_db tmp --tax-mapping-file /data/clustered_nr/nr_rep_seq_to_taxid.map -v 3 

MMseqs Version:        	2fad714b525f1975b62c2d2b5aff28274ad57466
NCBI tax dump directory	
Taxonomy mapping file  	/data/clustered_nr/nr_rep_seq_to_taxid.map
Taxonomy mapping mode  	0
Taxonomy db mode       	1
Threads                	20
Verbosity              	3

Download taxdump.tar.gz

02/01 11:29:59 [NOTICE] Downloading 1 item(s)
[#b8b044 0B/0B CN:1 DL:0B]                                                                                                                                                          
02/01 11:30:01 [NOTICE] Allocating disk space. Use --file-allocation=none to disable it. See --file-allocation option in man page for more details.
[#b8b044 51MiB/61MiB(84%) CN:1 DL:10MiB]                                                                                                                                            
02/01 11:30:08 [NOTICE] Download complete: tmp/taxdump.tar.gz

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
b8b044|OK  |   9.1MiB/s|tmp/taxdump.tar.gz

Status Legend:
(OK):download completed.
Loading nodes file ... Done, got 2550743 nodes
Loading merged file ... Done, added 75930 merged nodes.
Loading names file ... Done
Init RMQ ...Done
Killed

Context

I want to search some query sequences locally against a clustered version of the NR database. I downloaded the clustered database and taxonomy files (nr_cluster_taxid_formatted_final.tsv.gz, nr_rep_seq.fasta.gz) from here, which was created as per: https://research.arcadiascience.com/pub/resource-nr-clustering/release/3. I then processed the files like so:

gunzip -c nr_cluster_taxid_formatted_final.tsv.gz | cut -f1,2 > nr_rep_seq_to_taxid.map
mmseqs createdb nr_rep_seq.fasta.gz nr_rep_seq_db

After these completed successfully, I tried to create the taxdb as per the above:

mmseqs createtaxdb nr_rep_seq_db tmp --tax-mapping-file '/data/clustered_nr/nr_rep_seq_to_taxid.map' -v 3

But the job was killed.

Questions:

  • Was it likely killed because of exhausting my available RAM?
  • If so, is there a way to restrict it during taxdb creation? I tried out --split-memory-limit 50G but got Unrecognized parameter "--split-memory-limit"
  • Was it killed for a different reason, e.g. disk space?
  • The disk the database is stored on has 95 GB free at the moment, and the main db file (from mmseqs createdb) is 88.7GB. I haven't tried freeing up more space yet in case this is not the issue.

Your Environment

  • Git commit used: 2fad714b525f1975b62c2d2b5aff28274ad57466
  • Which MMseqs version was used: static (wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz)
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
  • CPU: Intel i9-10900X (20) @ 4.500GHz (AVX2 supported)
  • GPU: NVIDIA Quadro RTX 4000
  • Memory: 64GB
  • Operating system and version: Pop!_OS 22.04 LTS x86_64

charlesfoster avatar Feb 01 '24 01:02 charlesfoster

Hello again,

I've been revisiting mmseqs again for taxonomic assignment, and unwittingly ran into this problem again before finding my own Github issue (the circle of life!). I was just wondering whether by now there is any advice on creating a taxdb when RAM is limited? I;m working with a pre-clustered version of the NR database that is currently not available directly through mmseqs databases.

After the standard createdb command, I ran the following:

mmseqs createtaxdb nr_clustered_mmseqs ~/TMP  --ncbi-tax-dump ~/.taxonkit/ --tax-mapping-file /data/clustered_nr/clustered_nr_taxmapping.tsv

I get output as per the OP in this issue, until the process dies with:

[truncated]
Loading names file ... Done
Init RMQ ...Done
Killed

I can see that the problem was most likely the RAM being exhausted (I received exit status 137). My workstation has 64GB of RAM, and accessing a server with more RAM for the creation of this database is not likely to be feasible.

Thanks

p.s. in case you've missed it for any reason, I would also like to point out that the current automated download of the NR/NT fasta files from NCBI using mmseqs databases might not work as desired moving forwards. As noted at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/README.txt:

In April 2024, the BLAST FASTA files in this directory will no longer be
available. You can easily generate FASTA files yourself from the formatted
BLAST databases by using the BLAST utility blastdbcmd that comes with the
standalone BLAST programs. See NCBI Insights for more details
https://ncbiinsights.ncbi.nlm.nih.gov/2024/01/25/blast-fasta-unavailable-on-ftp/

charlesfoster avatar Jul 01 '24 05:07 charlesfoster

We now have a convertblastdb module, which is integrated into the databases downloader. Additionally, I added ClusteredNR and core_nt to the downloader.

milot-mirdita avatar Nov 08 '25 06:11 milot-mirdita