Mash
Mash copied to clipboard
Question about ambiguous k-mers with ambiguous nucleotides
Are k-mers with ambiguous nucleotides (e.g. N) included in the sketch or are they thrown out?
I would imagine the best strategy is to have Mash filter these kmers out. I suppose it could be handled by input processing: breaking fasta sequences into multiple sequences at every ambiguous nucleotide. This does not seem idea.
Thanks.
They are indeed thrown out; by default only k-mers with ACGT are used.
Thanks for the quick reply. Sounds like this is handled correctly. My only complaint is that it is not documented clearly here or in the paper. Perhaps this could be noted to the help or documentation. Even more obvious to the user would be to note the number of dropped kmers in with the info.
A quick note on this. I also had this question upon reading the paper. I found this, http://mash.readthedocs.io/en/latest/sketches.html#strand-and-alphabet, though still left me with the question of how gaps/ambiguous characters would be handled. My recommendation would be for http://mash.readthedocs.io/en/latest/sketches.html#ambiguous-characters section directly after #strand-and-alphabet.
Thanks for all your work on this by the way! This is a great tool.