MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

MMseqs search not finding exact and close-exact hits

Open mcn3159 opened this issue 9 months ago • 1 comments

Expected Behavior

Searching proteins against a database with similar and exact proteins (from bacterial refseq proteome) should return hits with similar and exact matches.

Current Behavior

Running mmseqs search returns few to no hits. However easy-search does output way more hits (an expected amount).

Steps to Reproduce (for bugs)

For mmseqs search:

  • create query and target databases with query_fasta and target_fasta
  • mmseqs search at 0.95 min-seq-id and coverage with coverage mode 0
  • mmseqs convertalis

For mmseqs easy-search:

  • Ran easy-search directly with query and target fastas, same search parameters

MMseqs Output (for bugs)

MMseqs search output: https://gist.github.com/mcn3159/9a5ed05852e2e83b8656d25f0333a8f3

Context

I am searching a fasta of known bacterial proteins against the bacterial refseq WP proteome. I noticed that only half of my original virulence proteins (out of ~8000) had hits against refseq. Refseq proteome is large so I found a minimal example where there is an exact match (as well as similar according to easy-search) between the target and query databases that mmseqs search doesn't seem to find, but easy-search does.

I can provide the larger fastas if more examples to replicate are necessary.

There are 2 fastas in the attached .zip file each containing 4 proteins, one of those is an exact match (same WP_number) and 2 proteins (WP_000633131.1 and WP_000633136.1) are very similar to the protein with the exact match.

fastas_to_search.zip query fasta = query_subset.faa target_fasta = 406_subset.faa

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 15.6f452
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): conda

mcn3159 avatar May 03 '24 18:05 mcn3159

The trap is likely the sequence identity estimation (see https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity).

Adding -a or --alignment-mode 3 fixes the issue. easy-search better detects when exact sequence identity is required, search does the sequence identity estimation by default and try to detect it.

milot-mirdita avatar May 04 '24 00:05 milot-mirdita