kmers
kmers copied to clipboard
Internal storage order
So after trying to write some code interfacing kmers with some k-mers parsed using needletail, I realized we are storing the k-mers in the opposite order. In kmers the leftmost nucleotide is stored in the highest order bit, in needletail and some of the other (C++) libraries I have used in the past it's the opposite.
Are there any strong arguments for one of these schemes over the other (cc @natir, @Daniel-Liu-c0deb0t, @luizirber)? Is this something we should also consider making part of the policy of how the k-mers are constructed?
I didn't have any argument to make a choice between big endian or little endian.
We can choose one or let the user choose what they want (another generic parameter, submodule, features). But let user choose what they want can generate a huge work.
Thanks @natir - I agree it would be a pain to genericize over. How hard would it be to switch the endianness to match the existing library representations in the space?
I think change could be easy.
My question is should we made this change? Needletail and other C++ libraries choose little endian, why? can we found argument? It's really the most popular choices?
@natir — I think that's a totally fair question. As you bring up in your initial response, I think the right answer is simply that the choice is arbitrary. I can't think of a compelling (performance or simplicity) reason, offhand, to choose one representation over the other.
Ideally what I'd like to do is to have a survey of how popular the 2 different encodings are. In the lack of a compelling reason to choose one over the other, popularity is an argument in and of itself (to maximize compatibility). Of course, I think getting a widespread and accurate survey will be a challenge.
Do you know of other libraries that use big endian? Do you have any idea of the prevalence of these encodings?
Edit : made a twitter poll in case it might be useful : https://twitter.com/nomad421/status/1464259140767948810.
Having both written my own kmer encoding scheme, and been burnt using libraries with different scheme, I thought I'd weigh in. There are actually 4 independent choices one needs to make when encoding kmers:
Base encoding scheme
The first is how to encode each base. 4-bit encoding is possible but the only place I've seen it used is in htslib to encode sequence bases - never in a kmer library.
Of the two bit encoding schemes, the two most common are ACGT=0123, and TCAG=0123. The latter is the scheme used for UCSC 2bit encoding and has the advantage that the two bits are biologically meaningful and complementing is just an XOR.
Base endianness
The second choice is what you're asking here. The conceptual difference is whether we consider the bases as a string, or as a number.
In a string interpretation, the bases are placed based on their string index (kmer offset = 2 * string offset). So TCAG is:
LSbit 01 23 45 67
T C A G
In a numeric interpretation, the bases are placed by interpreting the sequence string as a number. When representing numbers, the left-most (first) digit is the most significant. Interpreting the nucleotide sequence as a base-4 number we get:
MSbit 76 54 32 01
T C A G
There are actually two more choices a kmer encoding scheme can make that you haven't addressed:
Padding endianness
If the bases don't fill the entire data type, then you need to decide where the padding goes. If you have the kmer CAG, which bits are meaningful?
These ones?
LSbit 01 23 45 67
----------
Or these one?
LSbit 01 23 45 67
----------
The padding endianness doesn't necessarily correspond to the base endianness.
Array endianness
I've found that almost every kmer library is also a 2-bit encoding sequence library. Typically they're some variant on [u8], or [u64], but if you want to convert between two libraries, then you need to know the endianness of your array: does the first array element contain the first few bases of the sequence or the last?
I don't think I've encountered a final-bases-in-first-array-element encoding scheme here: at this level the consensus seems to be a string-like representation where the first element of the array corresponds to the first N bases of the sequence.
Actual endianness
One final consideration is the on-disk format. If you're using not using a multi-byte representation format then it's not a concern, but if you're using multi-byte data type and you need to interop with other libraries, then you do need to specify the endianness - even if that answer is just 'platform-native'.
Concluding remarks
In many way, most of these choices don't matter that much. There's some performance and implementation complexity pros and cons for each but they all a few bit manipulation operations away from each other.
For what it's worth, my preferred scheme is the one that most closely corresponds to the .2bit format: USCS base encoding; numeric byte endianness; padding in LSBs; string-like arrays; big-endian on disk. This allows you to directly subset arbitrary length kmers from a .2bit encoded file (conversion to platform endianness when using multi-byte data types notwithstanding).
Edit: rephased to not explicitly state big/little endianness except for on-disk ordering. Endianness is a numeric concept and, a kmer is really a bit-packed string, not a number so the concept of 'endianness' doesn't technically apply.
The approach I'm advocating has a few nice properties that aren't present in other kmer encoding schemes.
- An encoded 2-bit stream/array of length k represents a big-endian number of length 2k
- The byte sequence can be directly interpreted as a big-endian
[u8],[u16],[u32],[u64], or[u128].- No extra transforms required to get the kmer at position i. Just grab the integer primitive that contains your key, translate to platform-endianness, and bitshift out the relevant bases in the 1-2 words the kmer overlaps.
- Works for all word sizes and alignments.
- No need for the library to chose a word size or have different representation for different word size. They all happily interop (unsafe word-alignment of pointers also works if you're happy to read a few bytes outside your array bounds).
- Appending to a long 2-bit encoded sequence is just appending to the sequence.
- If the padding bits were in the MSBs, you'd have to bit shift your entire sequence.
In short, A format compatible with the UCSC .2bit format has a number of advantages when dealing with 2-bit encoded sequences. They're not particularly helpful/concerning if purely with kmers but, in my experience, every kmer library is either secretly or explicitly a library for 2-bit encoding of sequences. It makes a lot of sense to make this explicit in the library scope and actually support arbitrary length sequences. From this perspective, a format that plays nicely with both fixed and variable k makes the most sense.
TDLR: treating the first base as the most signficant bits allows for big-endian consistency across bytes, words, and entire sequences.
Performance-wise, u64::from_be() is just bswap(x86)/rev(ARM) so I doubt it'd make any meaningful impact as it's a tiny overhead that hit with 2-bit sequence -> kmer conversions.
Thanks a lot @d-cameron for all this information.
I will try to give my opinion on all point but I probably miss some stuff.
Base encoding scheme
Kmers already support, with not the best code I write in my all life, any encoding. If user want use specificity of some encoding, user can choose faster encoder, or implement another one. So for me this trouble is solve.
Base endianness
I think we can say the pool didn't help to much, each solution seems share same popularity. But I could add another project to the MSbit list, KFF choose Big Endian.
I think Base endianess and Padding endianness choices must be taken in common.
After some reflection I can say I prefer simplify the indexing of kmer. I think Big Endian base endianness with right align Padding endianness is the best choices to simplify access of one base in kmer. As example we can take 6-mer TCAGTC with encoding ACTG = 0123 in u8 slice will produce (if I didn't made any mistake)
198 | 6
11 00 01 10 | XX XX 01 10
G A C T | _ _ C T
This not natural ordering simplify indexing. if you want access to the i-base of kmer you run a division to found position of the correct u8 block in slice, and with modulo you can get the correct two bits. Function get_nuc could look like:
fn get_nuc<P>(kmer: &[P], index: uslice) -> P
where P: bit_field::BitField
{
(kmer[index / P::BIT_LENGTH] >> (index % P::BIT_LENGTH * 2)) & 0b11
}
Array endianness
Actual user of library can choose which type of slice (u8, u16, u32, u64, u128) is used to store Kmer, so for me it's solve.
Actual endianness (can we rename it Storing endianness ?)
For me isn't the trouble of Kmers library we didn't write a serialization format. But I agree that we need to pay attention to this issue as well, but it is clearly not a priority for me.
Actual user of library can choose which type of slice (u8, u16, u32, u64, u128) is used to store Kmer, so for me it's solve.
The main advantage of having the base, byte, and padding all for big-endian is that switching between storage words is trivial. That is, a [u8,2] and a u16 are identical, similarly, [u8,4] and u64. You can subset a Vec
The fact that this only works for big-endian and both x86 and ARM are little-endian is somewhat annoying as it means that whatever choice is made, it's a trade-off of something.
The fact that this only works for big-endian and both x86 and ARM are little-endian is somewhat annoying as it means that whatever choice is made, it's a trade-off of something.
For me there is no good solution but only a less bad solution. (I'm not sure we can define a less bad solution)