sux-rs icon indicating copy to clipboard operation
sux-rs copied to clipboard

Feature Request: extract up to a full (underlying) word type from BitFieldVec

Open rob-p opened this issue 1 year ago • 7 comments

I was wondering if it would be possible to add a function to BitFieldVec that would allow extract multiple elements and returning them in the underlying word type. My specific use-case is this (but I think it's a common pattern).

I have a long string of DNA (alphabet of size 4) that I am packing into a BitFieldVec of width 2. Later, however, I want to extract entire k-mers (substrings of fixed length k). It would be great to be able to extract an entire k-mer at once, which for most common values of k (i.e. k <= 32) would either be entirely in one underlying u64, or would span at most 2. Right now, I have to extract the individual 2-bit characters one-by-one and build up the k-mer, which is quite inefficient.

I would propose something like get_slice(start, num_elem) or get_slice_in_word(start, num_elem), but am happy to leave the design aspects to your preference. Let me know if you have any questions or feedback about the request.

Thanks! Rob

rob-p avatar Sep 02 '24 22:09 rob-p

For reference, I also added this to my own 'packed' wrapper type, which I think I created specifically because sux didn't have it:

https://github.com/RagnarGrootKoerkamp/minimizers/blob/master/src/par/packed.rs#L134

In the future, I would also like to have a to_u128 version of this, but not sure what's the best API there.

RagnarGrootKoerkamp avatar Sep 02 '24 22:09 RagnarGrootKoerkamp

What about getting an iterator, rather then a slice? Then you could collect it. I realize that the method you suggest is more convenient tho.

I think there's a way to get an unchecked iterator. So get_slice should be trivial to implement using that.

vigna avatar Sep 03 '24 10:09 vigna

Hi Sebastiano,

Indeed, the solution I am currently implementing to obtain what I need given the existing interface is to call into_unchecked_iter_from starting at the beginning of the slice I need, and then to collect the next k items into a machine word, appropriately shifted. This certainly gives the correct answer, but I'm guessing it's substantially more overhead (I've not looked at the generated assembly yet). The reason is that in the original BitFieldVec, the items that constitute the underlying k-mer are already packed maximally into the storage. So, if the k-mer exists within a single storage unit (say a u64 if we're using that), then obtaining it is just a fetch, a shift and a mask. It's a bit more work if it spans two storage units, but not much. On the other hand, with the iterator approach, each 2-bit element has to be properly shifted into the output word as it is obtained. For longer k, this can be non-trivially more work than just pulling out the needed bits. I do suspect the compiler is clever enough to help some, and the iterator is pulling in adjacent elements so everything is cached. However, if this use-case is general enough, it may be worthwhile to have specific support for it?

--Rob

rob-p avatar Sep 03 '24 14:09 rob-p

Ah ok I mistakenly understood you wanted them unpacked.

I already have this code in Java (the sublist code of the list view of a bit vector). I think it shouldn't be too difficult to port it.

And yes, it can be significantly optimized.

Or, we can adapt Ragnar's code.

vigna avatar Sep 03 '24 14:09 vigna

Hey, is there still interest for this?

vigna avatar Oct 27 '24 15:10 vigna

I think I worked around it for now in my own code, but for the future, yes this is definitely still useful.

I think the API that Rob suggested, taking a start index and number of elements to cover, makes sense. Then you can shift the resulting value so that it covers C * num_elements bits, starting at the lowest bit.

This is easy when the value fits in up to 7 bytes (or up to 58 bits, to be precise, or better: not 59, 61, 62, or 63 bits), since then a single unaligned read suffices. After that you may need two reads.

RagnarGrootKoerkamp avatar Oct 27 '24 15:10 RagnarGrootKoerkamp

Yes, I would still be interested in this feature and think it would be valuable in several contexts.

rob-p avatar Oct 27 '24 23:10 rob-p