MMseqs2
MMseqs2 copied to clipboard
MMseqs search not finding exact and close-exact hits
Expected Behavior
Searching proteins against a database with similar and exact proteins (from bacterial refseq proteome) should return hits with similar and exact matches.
Current Behavior
Running mmseqs search returns few to no hits. However easy-search does output way more hits (an expected amount).
Steps to Reproduce (for bugs)
For mmseqs search:
- create query and target databases with query_fasta and target_fasta
- mmseqs search at 0.95 min-seq-id and coverage with coverage mode 0
- mmseqs convertalis
For mmseqs easy-search:
- Ran easy-search directly with query and target fastas, same search parameters
MMseqs Output (for bugs)
MMseqs search output: https://gist.github.com/mcn3159/9a5ed05852e2e83b8656d25f0333a8f3
Context
I am searching a fasta of known bacterial proteins against the bacterial refseq WP proteome. I noticed that only half of my original virulence proteins (out of ~8000) had hits against refseq. Refseq proteome is large so I found a minimal example where there is an exact match (as well as similar according to easy-search) between the target and query databases that mmseqs search doesn't seem to find, but easy-search does.
I can provide the larger fastas if more examples to replicate are necessary.
There are 2 fastas in the attached .zip file each containing 4 proteins, one of those is an exact match (same WP_number) and 2 proteins (WP_000633131.1 and WP_000633136.1) are very similar to the protein with the exact match.
fastas_to_search.zip query fasta = query_subset.faa target_fasta = 406_subset.faa
Your Environment
Include as many relevant details about the environment you experienced the bug in.
- Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 15.6f452
- Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): conda
The trap is likely the sequence identity estimation (see https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity).
Adding -a
or --alignment-mode 3
fixes the issue. easy-search
better detects when exact sequence identity is required, search
does the sequence identity estimation by default and try to detect it.