Mash
Mash copied to clipboard
Mash skips the output for few protein sequences [K-mer 3 and Sketch size 1000]
Here,I am giving an example. I am trying to compare two sequences sequence1.fa and sequence2.fa.
sequence1.fa
sequence1 VKPFQTDALVITPGQTTNVLFTANASTNVGAVQQFFIAARPFVTGGGTFDNSTVAGIMSYNISNSNNSSS IMMPKLPSLNDTAFAANFSAKLR
sequence2.fa
sequence2 MTVYNATFTINFYNEGEWGGPEPYGYIKAYLTNPDHDFEIWKQDDWGKSTPERSTYTQTIKISSDTGSPI NQMCFYGDVKEYDVGNADDILAYPSQKVCSTPGVTVRLDGDEKGSYVTIKYSLTPA
I am using K-mer value 3 for my protein clustering. Clustering works well at this K-mer for the remaining sequences. However 2% of the sequence have no output (no error). eg;
mash sketch -k 3 -s 1000 -a sequence1.fa
mash sketch -k 3 -s 1000 -a sequence2.fa
I got the corresponding mash files (sequence1.msh and sequence2.msh ). I am unable to attach *.msh file here.
I have also tried with K-mer 2 and K-mer 4 for the same sequences. It works well. Only for the K-mer 3 there is no output (including Sketch sizes from 700 to 1000). I can reproduce this in my system. I am using Mash version 2.0 in my mac system. Let me know if you need some other information.
If you have time please have a look at it.
Thank you!
What's happening is the distance estimate between these sequences is greater than 1, so it gets filtered. Theoretically, this means they are so distant that each residue has likely mutated more than once. Practically, though, this can be treated the same as a distance of 1 -- the sequences are unrelated.
@ondovb Thank you very much for the information!.