Mash icon indicating copy to clipboard operation
Mash copied to clipboard

Question about ambiguous k-mers with ambiguous nucleotides

Open swamidass opened this issue 7 years ago • 3 comments

Are k-mers with ambiguous nucleotides (e.g. N) included in the sketch or are they thrown out?

I would imagine the best strategy is to have Mash filter these kmers out. I suppose it could be handled by input processing: breaking fasta sequences into multiple sequences at every ambiguous nucleotide. This does not seem idea.

Thanks.

swamidass avatar Feb 24 '17 17:02 swamidass

They are indeed thrown out; by default only k-mers with ACGT are used.

ondovb avatar Feb 24 '17 17:02 ondovb

Thanks for the quick reply. Sounds like this is handled correctly. My only complaint is that it is not documented clearly here or in the paper. Perhaps this could be noted to the help or documentation. Even more obvious to the user would be to note the number of dropped kmers in with the info.

swamidass avatar Feb 24 '17 20:02 swamidass

A quick note on this. I also had this question upon reading the paper. I found this, http://mash.readthedocs.io/en/latest/sketches.html#strand-and-alphabet, though still left me with the question of how gaps/ambiguous characters would be handled. My recommendation would be for http://mash.readthedocs.io/en/latest/sketches.html#ambiguous-characters section directly after #strand-and-alphabet.

Thanks for all your work on this by the way! This is a great tool.

MKLau avatar Dec 21 '17 17:12 MKLau