kmers Encoding Type

I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest

A -> 00
C -> 01
T -> 10
G -> 11

There's a couple of benefits:

These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.
Complement by using XOR ...0101010

On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.

Apr 28 '21 22:04 Daniel-Liu-c0deb0t

Hi @Daniel-Liu-c0deb0t,

I agree there are some nice properties to this encoding. Given that, it sort of bothers me that we went with the in-order encoding in most of our C++ software (e.g. here). Given that we'll have to interact with that it may be useful to have a utility function for (quickly) initializing a k-mer from a bit-packed k-mer using another encoding scheme. Frankly, I don't know if there are others apart from A=0, C=1, G=2, T=3 and the one you proposed, but it's worth discussing.

Regarding SIMD capability, I actually think that that may provide some useful benefits when k-mers no longer fit in a single word. For example, in cuttlefish we use a class similar to what I imagine this one being like, and when we want to use something like k=61 or k=75, we would probably be able to fill some SIMD registers.

Apr 29 '21 00:04 rob-p

Hi @rob-p and @Daniel-Liu-c0deb0t ,

The proposed encoding here is a nice one. Albeit, the A = 0, C = 1, G = 2, and T = 3 encoding does make the lexicographic comparisons simple and straightforward — facilitating fast canonical k-mers computation, which are heavily used in the lab softwares e.g. in cuttlefish.

Apr 29 '21 15:04 jamshed

Hi @jamshed-k,

Your point is important when we consider interoperating with other encodings. However, for many purposes, all that matters about canonicalization is self-consistency. From that perspective, any bijective mapping from the nucleotides to their encoded integers will work. For example, we could easily define the canonical k-mer to be the max rather than the min of the encoding and it's reverse complement. Likewise, we could define a reversible hash for k-mers and define the canonical k-mer to be the one with the minimum (or maximum) hash value. Really, all that matters is that we chose the canonical form consistently, I think.

Apr 29 '21 16:04 rob-p

Hi @rob-p,

Sure, obviously we can define the canonical k-mer with self-consistency under any bijective mapping from the nucleotide-bases to integers. My point is that if defining the canonicalization function for a k-mer x as the lexicographical minimum of x and x_bar (or the max) — as prevalent in the lab tools and literature — then the in-order mapping should be more efficient for the function computation.

Apr 29 '21 17:04 jamshed

Hi all,

I think it could be nice to support any encoding, by add an enum type in Kmer definition we can support any encoding.

If I didn't made any mistake we have only 24 possible encoding.

Jul 31 '21 19:07 natir

Frankly, I don't know if there are others

The UCSC 2bit format uses a different encoding:

T - 00, C - 01, A - 10, G - 11

https://genome.ucsc.edu/FAQ/FAQformat.html#format7

I think it could be nice to support any encoding

This would be my preference. The encoding only matters when converting to an unpacked representation or complementing the bases.

Oct 10 '21 05:10 d-cameron

I made some work around this. I create a trait encoding::Encoder, use by Kmer to populate storage.

I also created a struct encoding::Naive that can convert a DNA sequence to 2bit with any kind of encoding. I still need work on complement part, before create a PR for this.

The trait solution is nice because we can create any Encoder we want and the user can create one for his specific usage.

Oct 11 '21 07:10 natir

kmers kmers copied to clipboard

Encoding Type

kmers
kmers copied to clipboard