MMseqs2
MMseqs2 copied to clipboard
Wrong Type Database Error
Expected Behavior
The search should work as easy-search for the protein sequence fasta file
Current Behavior
Only easy-search is working for the protein sequence fasta file
Steps to Reproduce (for bugs)
mmseqs search /mount-nfs/mydataset/single_train_sequences.fasta /mount-nfs/unierf100/uniref100.fasta /mount-nfs/mmseq_single/alRes.m8 ./tmp
MMseqs Output (for bugs)
MMseqs Version: 14.7e284
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Add backtrace false
Alignment mode 2
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0
Coverage mode 0
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Threads 240
Compressed 0
Verbosity 3
Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out
Sensitivity 5.7
k-mer length 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.1
Global sequence weighting false
Allow deletions false
Filter MSA 1
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Pseudo count mode 0
Gap pseudo count 10
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files false
Input database "/mount-nfs/mydataset/single_train_sequences.fasta" has the wrong type (Generic)
Allowed input:
- Index
- Nucleotide
- Profile
- Aminoacid
Context
I am trying to extract the pssm for a big fasta file following this steps: https://github.com/soedinglab/MMseqs2/issues/580 Unfortunately, only the easy-search is working and not the search. If I tried to just use the easy-search I get the same message as above when I try to run the "result2profile" script.
I also tried to extract a single sequence from my fasta file, and I got the same error. Here is the single fasta file that I am trying to test with :
>A0A8I6GHU0 tr|A0A8I6GHU0|A0A8I6GHU0_RAT U6 snRNA-associated Sm-like protein LSm1 OS=Rattus norvegicus OX=10116 GN=Lsm1 PE=3 SV=1
HCISSLKLTAFFKRSFLLSPEKHLVLLRDGRTLIGFLRSIDQFANLVLHQTVERIHVGRK
YGDIPRGIFVVRGENVVLLGEIDLEKESDTPLQQVSIEEILEEQRVEQQSRLEAEKLKVQ
ALKDRGLSIPRADTLDEY
Your Environment
Include as many relevant details about the environment you experienced the bug in.
- Which MMseqs version was used: latest version from conda.
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU support both AVX2/SSE and the server has 450GB of memory
- Operating system and version: Ubuntu 20.04.4 LTS
The non-easy modules are intended to be used with MMseqs2 databases only. To use search
please call createdb
on the FASTA files first:
mmseqs createdb /mount-nfs/mydataset/single_train_sequences.fasta qdb
mmseqs createdb /mount-nfs/unierf100/uniref100.fasta uniref100
mmseqs search qdb uniref1000 res tmp ...
This is what the easy-search
workflow does internally anyway.
Also I recommend to search with as many queries as possible. Single query searches are very slow (except if you do a very specialized setup, similar to our server setups, however these are quite a bit more effort).
Thanks a lot, @milot-mirdita, for your help. It did work out. However, when I applied "result2profile" and "profile2pssm" to extract the pssm it showed protein sequences that were not part of the query dataset.
If my understanding is correct, the column "Cns" should represent a protein sequence in the query dataset.
The cns column stores the consensus sequences. That is not an actual sequence contained in the DB, but computed from the profile.
Perfect, thanks a lot for the explanation.
One last question is the order of the sequences in the pssm the same order as the query file? Or I have to match the sequences according to the first line of each pssm.
This is the first few lines in the pssm file:
Query profile of sequence 133520
Pos Cns A C D E F G H I K L M N P Q R S T V W Y
0 M -2 -1 -3 -3 0 -1 -2 0 -3 0 8 -2 -2 -2 -2 -3 -2 0 0 -2
1 R 0 -4 -3 -2 -3 -2 -2 -4 0 -4 -3 -2 1 -1 7 -3 -3 -3 -2 -3
2 T -2 -2 -2 -3 -3 -1 -3 -1 0 -3 0 0 -2 -1 -2 2 5 -1 -2 -3
3 V -1 -2 -4 -5 0 -3 -4 2 -4 0 0 -4 -3 -4 -4 -4 -2 5 -2 -3
4 L -2 -3 -4 -5 1 -3 -2 2 -4 3 0 -4 -4 -4 -3 -4 -2 0 0 5
5 A 3 5 -3 -3 -2 -1 -3 -1 -3 2 -1 -3 -2 -3 -3 1 -2 -1 -2 -4
6 E -2 -4 3 4 -4 -2 0 -4 -2 -5 -3 -1 4 -1 -2 2 -1 -4 -3 -4
7 E 0 -4 2 3 -4 -2 -2 -4 3 -4 -3 -1 -2 0 0 1 -1 -2 -3 -4
As you can see, the first line seems to point to sequence number 133520.
Thanks again for all your help and support.