MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Wrong Type Database Error

Open agemagician opened this issue 1 year ago • 5 comments

Expected Behavior

The search should work as easy-search for the protein sequence fasta file

Current Behavior

Only easy-search is working for the protein sequence fasta file

Steps to Reproduce (for bugs)

mmseqs search /mount-nfs/mydataset/single_train_sequences.fasta /mount-nfs/unierf100/uniref100.fasta /mount-nfs/mmseq_single/alRes.m8 ./tmp

MMseqs Output (for bugs)

MMseqs Version:                         14.7e284
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           false
Alignment mode                          2
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Threads                                 240
Compressed                              0
Verbosity                               3
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             5.7
k-mer length                            0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.1
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Use filter only at N seqs               0
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0.0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Pseudo count mode                       0
Gap pseudo count                        10
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  false

Input database "/mount-nfs/mydataset/single_train_sequences.fasta" has the wrong type (Generic)
Allowed input:
- Index
- Nucleotide
- Profile
- Aminoacid

Context

I am trying to extract the pssm for a big fasta file following this steps: https://github.com/soedinglab/MMseqs2/issues/580 Unfortunately, only the easy-search is working and not the search. If I tried to just use the easy-search I get the same message as above when I try to run the "result2profile" script.

I also tried to extract a single sequence from my fasta file, and I got the same error. Here is the single fasta file that I am trying to test with :

>A0A8I6GHU0 tr|A0A8I6GHU0|A0A8I6GHU0_RAT U6 snRNA-associated Sm-like protein LSm1 OS=Rattus norvegicus OX=10116 GN=Lsm1 PE=3 SV=1
HCISSLKLTAFFKRSFLLSPEKHLVLLRDGRTLIGFLRSIDQFANLVLHQTVERIHVGRK
YGDIPRGIFVVRGENVVLLGEIDLEKESDTPLQQVSIEEILEEQRVEQQSRLEAEKLKVQ
ALKDRGLSIPRADTLDEY

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Which MMseqs version was used: latest version from conda.
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU support both AVX2/SSE and the server has 450GB of memory
  • Operating system and version: Ubuntu 20.04.4 LTS

agemagician avatar Jun 15 '23 19:06 agemagician

The non-easy modules are intended to be used with MMseqs2 databases only. To use search please call createdb on the FASTA files first:

mmseqs createdb /mount-nfs/mydataset/single_train_sequences.fasta qdb
mmseqs createdb /mount-nfs/unierf100/uniref100.fasta uniref100
mmseqs search qdb uniref1000 res tmp ...

This is what the easy-search workflow does internally anyway.

milot-mirdita avatar Jun 16 '23 01:06 milot-mirdita

Also I recommend to search with as many queries as possible. Single query searches are very slow (except if you do a very specialized setup, similar to our server setups, however these are quite a bit more effort).

milot-mirdita avatar Jun 16 '23 01:06 milot-mirdita

Thanks a lot, @milot-mirdita, for your help. It did work out. However, when I applied "result2profile" and "profile2pssm" to extract the pssm it showed protein sequences that were not part of the query dataset.

If my understanding is correct, the column "Cns" should represent a protein sequence in the query dataset.

agemagician avatar Jun 17 '23 19:06 agemagician

The cns column stores the consensus sequences. That is not an actual sequence contained in the DB, but computed from the profile.

milot-mirdita avatar Jun 18 '23 06:06 milot-mirdita

Perfect, thanks a lot for the explanation.

One last question is the order of the sequences in the pssm the same order as the query file? Or I have to match the sequences according to the first line of each pssm.

This is the first few lines in the pssm file:

Query profile of sequence 133520
Pos	Cns	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
0	M	-2	-1	-3	-3	0	-1	-2	0	-3	0	8	-2	-2	-2	-2	-3	-2	0	0	-2
1	R	0	-4	-3	-2	-3	-2	-2	-4	0	-4	-3	-2	1	-1	7	-3	-3	-3	-2	-3
2	T	-2	-2	-2	-3	-3	-1	-3	-1	0	-3	0	0	-2	-1	-2	2	5	-1	-2	-3
3	V	-1	-2	-4	-5	0	-3	-4	2	-4	0	0	-4	-3	-4	-4	-4	-2	5	-2	-3
4	L	-2	-3	-4	-5	1	-3	-2	2	-4	3	0	-4	-4	-4	-3	-4	-2	0	0	5
5	A	3	5	-3	-3	-2	-1	-3	-1	-3	2	-1	-3	-2	-3	-3	1	-2	-1	-2	-4
6	E	-2	-4	3	4	-4	-2	0	-4	-2	-5	-3	-1	4	-1	-2	2	-1	-4	-3	-4
7	E	0	-4	2	3	-4	-2	-2	-4	3	-4	-3	-1	-2	0	0	1	-1	-2	-3	-4

As you can see, the first line seems to point to sequence number 133520.

Thanks again for all your help and support.

agemagician avatar Jun 19 '23 08:06 agemagician