Difference in evalue when doijng search against swissprot_human and swissprot
I would like to find the closest human protein sequence for any given query. Often the query sequences are short (10aa). To test if mmseq could help with that, I created a fasta file that contains:
- 10 protein sequences from the filtered swissprot dataset to include just human samples (aka, swissprot_human)
- 10 short sequences created through sampling one segment of 10aa from the first 10 human sequences
The resulting sequences are illustrated in this csv (I couldn't upload the csv, so I am sharing a link. I hope you can access it).
You will notice that the same query sequence has a different evalue between swissprot and swissprot_human, which I assume plays a role in the reduced number of matches against swissprot. How can I ensure the search results are consistent for the two datasets?
Thank you so much for the help!
How did you do the searches? How did you create the swissprot_human subset?
I am surprised you are getting results at all for your test 2). The default parameters of MMseqs2 are set in such a way that you should not be able to find anything shorter than 11. The default k-mer size of 6 uses 4 non-informative spaced positions and we require a double consecutive k-mer match, meaning 11 should be the shortest.
In my search, I am interested only in exact matches to known human proteins, regardless of the significance (which seems to be conditioned on the length of the sequence). For that, I have changed the significance threshold, k-mer size and spacer. This is the search command I am using:
mmseqs search /tmp/queryDB /root/mmseqs2_db/swissprot /tmp/resultDB tmp -a --max-seqs 1000 -s 7.5 --comp-bias-corr 0 --mask 0 --split-memory-limit 120G -e 10000 -k 7 --spaced-kmer-mode 0
To create the human subset of the swissprot, I used the following commands to download and filter the dataset:
mmseqs databases UniProtKB/Swiss-Prot /root/mmseqs2_db/swissprot tmp
mmseqs filtertaxseqdb /root/mmseqs2_db/swissprot /root/mmseqs2_db/swissprot_human --taxon-list 9606