Mash icon indicating copy to clipboard operation
Mash copied to clipboard

finding similar sequences

Open sbdk82 opened this issue 7 years ago • 3 comments

I am trying to find similar sequences of a given query sequence. I am assuming all sequences are of same length and I have to find the similar ones from a set of million sequences. How could I use mash to achieve this?

sbdk82 avatar Mar 15 '17 14:03 sbdk82

Assuming you have the subject set in a multi-fasta:

mash sketch -i subjects.fna mash sketch query.fna mash dist subjects.fna.msh query.fna.msh | sort -gk3 > out

Other sketching parameters may be appropriate for your data, but the -i is key.

ondovb avatar Mar 15 '17 17:03 ondovb

Thanks !! I want to use the code in my program i.e. generating the output in my C++ program (instead of using command line). Could you please guide me? The inputs are one subjects.txt file containing all sequences (or non-genomic strings) and a query string.

sbdk82 avatar Mar 15 '17 17:03 sbdk82

Mash isn't officially encapsulated as a library so that won't be completely straightforward, but it can be done with some copying and pasting. A good place to start would be copying and modifying CommandDistance.cpp, which handles the I/O in run() and has some global functions for the comparisons. A real API is currently a wish-list item, and related to #49.

ondovb avatar Mar 15 '17 17:03 ondovb