dbgfm Ambiguous IUPAC codes

These changes adds functionality to accept ambiguous IUPAC codes in bwtdisk_prepare.

It handles the ambiguity by choosing a random base that is within the set of bases for that ambiguity code. For example, N can be A, C, G, or T; S can be C or G; H can be A, C, or T; etc.

This required libdbgfm to be linked to bwtdisk_prepare because it uses the IUPAC methods found in alphabet.cpp.

Aug 03 '17 16:08 Colelyman

I'd suggest picking the lexicographically smallest possible nucleotide for that ambiguity code rather than a random one, to make the result deterministic.

Aug 03 '17 23:08 sjackman

I have updated the function so that the lexicographically smallest possible nucleotide for each ambiguity code. Thanks for the suggestion @sjackman

Aug 07 '17 16:08 Colelyman

What's your use case, Cole? Is it that you have reads with Ns in them, or do you have reads with other IUPAC codes in them, or are you working with sequences other than reads?

Aug 07 '17 19:08 sjackman

My use case is using assembled genomes. Ideally, I would like to be able to keep Ns, but then I thought it would be helpful to accept all IUPAC codes.

Do you know how hard it would be to accept Ns? I figured it might be difficult to add another character to the alphabet due to the encoding/compression.

Aug 07 '17 20:08 Colelyman

Jared (@jts) is in a better position to answer that question than myself.

Aug 07 '17 21:08 sjackman