MMseqs2 Can't create index nt database

Expected Behavior

Nt database indexed

Current Behavior

Nt database index never-ending

Steps to Reproduce (for bugs)

mmseqs createdb nt.fa nt -v 3 
mmseqs createtaxdb nt tmp --threads 8 --tax-mapping-file  ${uncompress_dir}/taxidmapping --ncbi-tax-dump ${ncbi-tax-dump} -v 3
mmseqs createindex nt tmp  --threads 8  --split-memory-limit 200G  --search-type 2 -v 3

MMseqs Output (for bugs)

In the link, there is the output of mmseq and the strace output when the software never end( mmseqs indexdb command). https://gist.github.com/braffes/022572a4d9506f8910b281864a459ede

Context

The first and the second step work as expected, but the last step seems to never end. It is blocked on this command:

mmseqs indexdb tmp/16033012438524647487/orfs_aa nt --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --alph-size nucl:5,aa:21 --comp-bias-corr 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score 0 --check-compatible 0 --search-type 2 --split 0 --split-memory-limit 200G -v 3 --threads 8

One core is always on 100% CPU but not nothing happens.

I try to do this step with the scheduler(slurm) and localy, but that's the same result.

After few try, I try to decrease the size of the fasta file(385Go to 172Go), and it worked. It could be a problem of scaling? If yes, can it be related to the type in DBReader being unsigned int in indexdb.cpp?

Your Environment

Include as many relevant details about the environment you experienced the bug in. The problem is encountered for the two following versions MMseqs2/12-113e3 and MMseqs2/13-45111 96 cores 2To RAM CentOS Linux release 8.3.2011

Mar 16 '21 11:03 braffes

Could you attach gdb and see where its stuck?

#attach to process
gdb -p PID
# interrupt the process with ctrl+c
# wait for prompt then run
bt 
# copy paste output

Mar 16 '21 12:03 milot-mirdita

I hope it will help you.

(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fff53ed22c8: 
#0  0x00000000004dfe25 in DBReader<unsigned int>::sortIndex(bool) ()
Backtrace stopped: Cannot access memory at address 0x7fff53ed22c8

Mar 16 '21 13:03 braffes

I'll try to reproduce it locally. This stack trace is pretty surprising as that function should be parallelized and pretty quick to execute. Could you recompile with cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo and execute it again to see where exactly it is stuck?

Is the database on a (slow?) network share?

Mar 17 '21 14:03 milot-mirdita

So pretty sure I know what's going on. Doing a tblastx style search against the NT results in over 4B fragments (slightly over 7B). MMseqs2 can search at most against 4B (UINT_MAX) fragments at a time. You could use the --min-length parameter to increase the minimum fragment length to cut down on the number of ORFs produced. By default it will extract fragments of at least 30 codons (30*3 nucleotides) long.

Getting around this limitation is a longer term goal, that we haven't really decided how to tackle yet.

Mar 18 '21 15:03 milot-mirdita

Alternatively you can use a nucl-nucl search with --search-type 3, so the database size doesn't explode.

Mar 18 '21 15:03 milot-mirdita

Sorry for the delay to answer. Thank you for giving some help to avoir the problem.

I give you the output with -DCMAKE_BUILD_TYPE=RelWithDebInfo

#0  DBReader<unsigned int>::sortIndex (this=this@entry=0x7ffe5bd0a970, isSortedById=isSortedById@entry=false) at /path/to/MMseqs2/src/commons/DBReader.cpp:249
#1  0x00000000004fbc6b in DBReader<unsigned int>::open (this=this@entry=0x7ffe5bd0a970, accessType=accessType@entry=0) at /path/to/MMseqs2/src/commons/MemoryTracker.h:13
#2  0x00000000005c1c81 in indexdb (argc=<optimized out>, argv=<optimized out>, command=...) at /path/to//MMseqs2/src/util/indexdb.cpp:64
#3  0x0000000000471e20 in runCommand (p=0x2394310, argc=argc@entry=36, argv=argv@entry=0x7ffe5bd0af48) at /path/to/MMseqs2/src/commons/Application.cpp:40
#4  0x0000000000460455 in main (argc=38, argv=0x7ffe5bd0af38) at /path/to/MMseqs2/src/commons/Application.cpp:203

As you said it is a issue with the max size, the size is greater than 4294967295 (unsigned int), so it becomes an infinite loop.

Do you have see a solution in a long term goal about this problem?

Nov 15 '21 10:11 braffes

Hi, I was just wondering if there was any progress on a solution for this? I have to cluster 36B genes. I have some ideas on how to work around this problem but it would be a lot cleaner if I didn't have to divide and conquer as it were. If can help in any way I will!

Alternatively if you had any suggestions for a workaround it would be very appreciated.

Oct 19 '23 16:10 fullama

MMseqs2 MMseqs2 copied to clipboard

Can't create index nt database

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

MMseqs2
MMseqs2 copied to clipboard