dbgfm icon indicating copy to clipboard operation
dbgfm copied to clipboard

Ambiguous IUPAC codes

Open Colelyman opened this issue 8 years ago • 5 comments

These changes adds functionality to accept ambiguous IUPAC codes in bwtdisk_prepare.

It handles the ambiguity by choosing a random base that is within the set of bases for that ambiguity code. For example, N can be A, C, G, or T; S can be C or G; H can be A, C, or T; etc.

This required libdbgfm to be linked to bwtdisk_prepare because it uses the IUPAC methods found in alphabet.cpp.

Colelyman avatar Aug 03 '17 16:08 Colelyman

I'd suggest picking the lexicographically smallest possible nucleotide for that ambiguity code rather than a random one, to make the result deterministic.

sjackman avatar Aug 03 '17 23:08 sjackman

I have updated the function so that the lexicographically smallest possible nucleotide for each ambiguity code. Thanks for the suggestion @sjackman

Colelyman avatar Aug 07 '17 16:08 Colelyman

What's your use case, Cole? Is it that you have reads with Ns in them, or do you have reads with other IUPAC codes in them, or are you working with sequences other than reads?

sjackman avatar Aug 07 '17 19:08 sjackman

My use case is using assembled genomes. Ideally, I would like to be able to keep Ns, but then I thought it would be helpful to accept all IUPAC codes.

Do you know how hard it would be to accept Ns? I figured it might be difficult to add another character to the alphabet due to the encoding/compression.

Colelyman avatar Aug 07 '17 20:08 Colelyman

Jared (@jts) is in a better position to answer that question than myself.

sjackman avatar Aug 07 '17 21:08 sjackman