Ambiguous IUPAC codes
These changes adds functionality to accept ambiguous IUPAC codes in bwtdisk_prepare.
It handles the ambiguity by choosing a random base that is within the set of bases for that ambiguity code. For example, N can be A, C, G, or T; S can be C or G; H can be A, C, or T; etc.
This required libdbgfm to be linked to bwtdisk_prepare because it uses the IUPAC methods found in alphabet.cpp.
I'd suggest picking the lexicographically smallest possible nucleotide for that ambiguity code rather than a random one, to make the result deterministic.
I have updated the function so that the lexicographically smallest possible nucleotide for each ambiguity code. Thanks for the suggestion @sjackman
What's your use case, Cole? Is it that you have reads with Ns in them, or do you have reads with other IUPAC codes in them, or are you working with sequences other than reads?
My use case is using assembled genomes. Ideally, I would like to be able to keep Ns, but then I thought it would be helpful to accept all IUPAC codes.
Do you know how hard it would be to accept Ns? I figured it might be difficult to add another character to the alphabet due to the encoding/compression.
Jared (@jts) is in a better position to answer that question than myself.