MMseqs2 Wrong Type Database Error

Expected Behavior

The search should work as easy-search for the protein sequence fasta file

Current Behavior

Only easy-search is working for the protein sequence fasta file

Steps to Reproduce (for bugs)

mmseqs search /mount-nfs/mydataset/single_train_sequences.fasta /mount-nfs/unierf100/uniref100.fasta /mount-nfs/mmseq_single/alRes.m8 ./tmp

MMseqs Output (for bugs)

MMseqs Version:                         14.7e284
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           false
Alignment mode                          2
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Threads                                 240
Compressed                              0
Verbosity                               3
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             5.7
k-mer length                            0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.1
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Use filter only at N seqs               0
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0.0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Pseudo count mode                       0
Gap pseudo count                        10
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  false

Input database "/mount-nfs/mydataset/single_train_sequences.fasta" has the wrong type (Generic)
Allowed input:
- Index
- Nucleotide
- Profile
- Aminoacid

Context

I am trying to extract the pssm for a big fasta file following this steps: https://github.com/soedinglab/MMseqs2/issues/580 Unfortunately, only the easy-search is working and not the search. If I tried to just use the easy-search I get the same message as above when I try to run the "result2profile" script.

I also tried to extract a single sequence from my fasta file, and I got the same error. Here is the single fasta file that I am trying to test with :

>A0A8I6GHU0 tr|A0A8I6GHU0|A0A8I6GHU0_RAT U6 snRNA-associated Sm-like protein LSm1 OS=Rattus norvegicus OX=10116 GN=Lsm1 PE=3 SV=1
HCISSLKLTAFFKRSFLLSPEKHLVLLRDGRTLIGFLRSIDQFANLVLHQTVERIHVGRK
YGDIPRGIFVVRGENVVLLGEIDLEKESDTPLQQVSIEEILEEQRVEQQSRLEAEKLKVQ
ALKDRGLSIPRADTLDEY

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Which MMseqs version was used: latest version from conda.
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU support both AVX2/SSE and the server has 450GB of memory
Operating system and version: Ubuntu 20.04.4 LTS

Jun 15 '23 19:06 agemagician

The non-easy modules are intended to be used with MMseqs2 databases only. To use search please call createdb on the FASTA files first:

mmseqs createdb /mount-nfs/mydataset/single_train_sequences.fasta qdb
mmseqs createdb /mount-nfs/unierf100/uniref100.fasta uniref100
mmseqs search qdb uniref1000 res tmp ...

This is what the easy-search workflow does internally anyway.

Jun 16 '23 01:06 milot-mirdita

Also I recommend to search with as many queries as possible. Single query searches are very slow (except if you do a very specialized setup, similar to our server setups, however these are quite a bit more effort).

Jun 16 '23 01:06 milot-mirdita

Thanks a lot, @milot-mirdita, for your help. It did work out. However, when I applied "result2profile" and "profile2pssm" to extract the pssm it showed protein sequences that were not part of the query dataset.

If my understanding is correct, the column "Cns" should represent a protein sequence in the query dataset.

Jun 17 '23 19:06 agemagician

The cns column stores the consensus sequences. That is not an actual sequence contained in the DB, but computed from the profile.

Jun 18 '23 06:06 milot-mirdita

Perfect, thanks a lot for the explanation.

One last question is the order of the sequences in the pssm the same order as the query file? Or I have to match the sequences according to the first line of each pssm.

This is the first few lines in the pssm file:

Query profile of sequence 133520
Pos	Cns	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
0	M	-2	-1	-3	-3	0	-1	-2	0	-3	0	8	-2	-2	-2	-2	-3	-2	0	0	-2
1	R	0	-4	-3	-2	-3	-2	-2	-4	0	-4	-3	-2	1	-1	7	-3	-3	-3	-2	-3
2	T	-2	-2	-2	-3	-3	-1	-3	-1	0	-3	0	0	-2	-1	-2	2	5	-1	-2	-3
3	V	-1	-2	-4	-5	0	-3	-4	2	-4	0	0	-4	-3	-4	-4	-4	-2	5	-2	-3
4	L	-2	-3	-4	-5	1	-3	-2	2	-4	3	0	-4	-4	-4	-3	-4	-2	0	0	5
5	A	3	5	-3	-3	-2	-1	-3	-1	-3	2	-1	-3	-2	-3	-3	1	-2	-1	-2	-4
6	E	-2	-4	3	4	-4	-2	0	-4	-2	-5	-3	-1	4	-1	-2	2	-1	-4	-3	-4
7	E	0	-4	2	3	-4	-2	-2	-4	3	-4	-3	-1	-2	0	0	1	-1	-2	-3	-4

As you can see, the first line seems to point to sequence number 133520.

Thanks again for all your help and support.

Jun 19 '23 08:06 agemagician

MMseqs2 MMseqs2 copied to clipboard

Wrong Type Database Error

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

MMseqs2
MMseqs2 copied to clipboard