Mash
Mash copied to clipboard
finding similar sequences
I am trying to find similar sequences of a given query sequence. I am assuming all sequences are of same length and I have to find the similar ones from a set of million sequences. How could I use mash to achieve this?
Assuming you have the subject set in a multi-fasta:
mash sketch -i subjects.fna
mash sketch query.fna
mash dist subjects.fna.msh query.fna.msh | sort -gk3 > out
Other sketching parameters may be appropriate for your data, but the -i
is key.
Thanks !! I want to use the code in my program i.e. generating the output in my C++ program (instead of using command line). Could you please guide me? The inputs are one subjects.txt file containing all sequences (or non-genomic strings) and a query string.
Mash isn't officially encapsulated as a library so that won't be completely straightforward, but it can be done with some copying and pasting. A good place to start would be copying and modifying CommandDistance.cpp
, which handles the I/O in run()
and has some global functions for the comparisons. A real API is currently a wish-list item, and related to #49.