MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Difference in evalue when doijng search against swissprot_human and swissprot

Open ahof1704 opened this issue 1 year ago • 2 comments

I would like to find the closest human protein sequence for any given query. Often the query sequences are short (10aa). To test if mmseq could help with that, I created a fasta file that contains:

  1. 10 protein sequences from the filtered swissprot dataset to include just human samples (aka, swissprot_human)
  2. 10 short sequences created through sampling one segment of 10aa from the first 10 human sequences

The resulting sequences are illustrated in this csv (I couldn't upload the csv, so I am sharing a link. I hope you can access it).

You will notice that the same query sequence has a different evalue between swissprot and swissprot_human, which I assume plays a role in the reduced number of matches against swissprot. How can I ensure the search results are consistent for the two datasets?

Thank you so much for the help!

ahof1704 avatar Mar 25 '24 11:03 ahof1704

How did you do the searches? How did you create the swissprot_human subset?

I am surprised you are getting results at all for your test 2). The default parameters of MMseqs2 are set in such a way that you should not be able to find anything shorter than 11. The default k-mer size of 6 uses 4 non-informative spaced positions and we require a double consecutive k-mer match, meaning 11 should be the shortest.

milot-mirdita avatar Mar 27 '24 05:03 milot-mirdita

In my search, I am interested only in exact matches to known human proteins, regardless of the significance (which seems to be conditioned on the length of the sequence). For that, I have changed the significance threshold, k-mer size and spacer. This is the search command I am using: mmseqs search /tmp/queryDB /root/mmseqs2_db/swissprot /tmp/resultDB tmp -a --max-seqs 1000 -s 7.5 --comp-bias-corr 0 --mask 0 --split-memory-limit 120G -e 10000 -k 7 --spaced-kmer-mode 0

To create the human subset of the swissprot, I used the following commands to download and filter the dataset: mmseqs databases UniProtKB/Swiss-Prot /root/mmseqs2_db/swissprot tmp mmseqs filtertaxseqdb /root/mmseqs2_db/swissprot /root/mmseqs2_db/swissprot_human --taxon-list 9606

ahof1704 avatar Mar 27 '24 13:03 ahof1704