kmers
kmers copied to clipboard
Encoding Type
I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest
A -> 00
C -> 01
T -> 10
G -> 11
There's a couple of benefits:
- These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.
- Complement by using XOR ...0101010
On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.
Hi @Daniel-Liu-c0deb0t,
I agree there are some nice properties to this encoding. Given that, it sort of bothers me that we went with the in-order encoding in most of our C++ software (e.g. here). Given that we'll have to interact with that it may be useful to have a utility function for (quickly) initializing a k-mer from a bit-packed k-mer using another encoding scheme. Frankly, I don't know if there are others apart from A=0, C=1, G=2, T=3 and the one you proposed, but it's worth discussing.
Regarding SIMD capability, I actually think that that may provide some useful benefits when k-mers no longer fit in a single word. For example, in cuttlefish we use a class similar to what I imagine this one being like, and when we want to use something like k=61 or k=75, we would probably be able to fill some SIMD registers.
Hi @rob-p and @Daniel-Liu-c0deb0t ,
The proposed encoding here is a nice one. Albeit, the A = 0, C = 1, G = 2, and T = 3 encoding does make the lexicographic comparisons simple and straightforward — facilitating fast canonical k-mers computation, which are heavily used in the lab softwares e.g. in cuttlefish.
Hi @jamshed-k,
Your point is important when we consider interoperating with other encodings. However, for many purposes, all that matters about canonicalization is self-consistency. From that perspective, any bijective mapping from the nucleotides to their encoded integers will work. For example, we could easily define the canonical k-mer to be the max rather than the min of the encoding and it's reverse complement. Likewise, we could define a reversible hash for k-mers and define the canonical k-mer to be the one with the minimum (or maximum) hash value. Really, all that matters is that we chose the canonical form consistently, I think.
Hi @rob-p,
Sure, obviously we can define the canonical k-mer with self-consistency under any bijective mapping from the nucleotide-bases to integers. My point is that if defining the canonicalization function for a k-mer x
as the lexicographical minimum of x
and x_bar
(or the max) — as prevalent in the lab tools and literature — then the in-order mapping should be more efficient for the function computation.
Hi all,
I think it could be nice to support any encoding, by add an enum type in Kmer
definition we can support any encoding.
If I didn't made any mistake we have only 24 possible encoding.
Frankly, I don't know if there are others
The UCSC 2bit format uses a different encoding:
T - 00, C - 01, A - 10, G - 11
https://genome.ucsc.edu/FAQ/FAQformat.html#format7
I think it could be nice to support any encoding
This would be my preference. The encoding only matters when converting to an unpacked representation or complementing the bases.
I made some work around this. I create a trait encoding::Encoder, use by Kmer to populate storage.
I also created a struct encoding::Naive that can convert a DNA sequence to 2bit with any kind of encoding. I still need work on complement part, before create a PR for this.
The trait solution is nice because we can create any Encoder we want and the user can create one for his specific usage.