CRoaring icon indicating copy to clipboard operation
CRoaring copied to clipboard

Improve the performance of roaring bitmap under sparse cases

Open stdpain opened this issue 3 years ago • 14 comments

In a sparse roaringbitmap, the size of array_container may often be 1, 2, 3. But we can actually reuse the container pointer (sizeof(void*)) as an array of uint16_t to avoid extra memory allocation. eg:

typedef struct roaring_array_s {
    int32_t size;
    int32_t allocation_size;
    void **containers;
    uint16_t *keys;
    uint8_t *typecodes;
    uint8_t flags;
} roaring_array_t;

roaring_array_t ra; // some init code ra.typecodes[i] = SINGLE_CONTAIN_TYPE + cardinality << 6; ((uint16_t*)ra.containers[i])[1] = value1; ((uint16_t*)ra.containers[i])[2] = value2; ((uint16_t*)ra.containers[i])[3] = value3;

stdpain avatar Sep 07 '21 04:09 stdpain

Yes. But before we add complexity, we would need some motivating benchmarks showing the the extra work is worth it.

lemire avatar Sep 07 '21 18:09 lemire

@lemire Hi ,I tested this optimization briefly in my application and the improvement for deserialization is quite obvious.

This is my test data generator:

 // the same key will add to the same bitmap
 static std::default_random_engine e;
 static std::uniform_int_distribution<int64_t> u(start_range, end_range);
 // start_range = 0
 // end_range = 5000000000
 // keys == 1000
 for(int k = 0;k < keys;k++) {
 // lines == 50000
    for(int i = 0;i < lines; ++i) {
        std::cout << k << "\t" << u(e) << '\n';
    }
 }

This is the change I made to CRoaring (an earlier version):

https://github.com/stdpain/CRoaring/commit/a0f5049478cefe93592975d1c128bc487b3aadc7

I test It In my Application

before:
mysql> select bitmap_union_count(v) from bitmap_r50E_x;
+-------------------------+
| bitmap_union_count(`v`) |
+-------------------------+
|               100000000 |
+-------------------------+
1 row in set (8.63 sec)
after:
mysql> select bitmap_union_count(v) from bitmap_r50E_x;
+-------------------------+
| bitmap_union_count(`v`) |
+-------------------------+
|               100000000 |
+-------------------------+
1 row in set (3.67 sec)

stdpain avatar Sep 08 '21 07:09 stdpain

Yes. I saw this and it makes sense to me. I'd still want someone to review it independently.

lemire avatar Sep 08 '21 15:09 lemire

I proposed a similar concept a while ago for rle but never got around writing the benchmarks. I'd advice trying to use a union for the array_container_t and infer whether we're using the pointer as array from the capacity and measuring again. That way we avoid adding a new container type, which would either require breaking backwards compatibility for the frozen format or losing this optimization when loading from them (a valid option, but if it's not too expensive to keep it I think it's better to do so). Besides, that goes in line with some proposed changes to representation that involve embedding the structs in the array anyway. Other than that, it seems OK to me, some style fixes and the like would be needed for the provided patch.

Oppen avatar Oct 22 '21 12:10 Oppen

@stdpain would you mind making a PR from your changes? Otherwise, at some unspecified point in the future would you like me to create one (with proper credit)?

Oppen avatar Apr 26 '22 20:04 Oppen

@Oppen OK

stdpain avatar May 01 '22 17:05 stdpain

I have a very ugly suggestion but you may be interested in trying it. Since our arrays are always sorted and never* empty we could simply zero-terminate them for the low count case, gaining a few more elements or even eliminating the indirection altogether, not storing the sizes at all. If the first element is zero, then there's a zero, but every other zero signals end of container.

Oppen avatar Aug 11 '22 06:08 Oppen

I have a very ugly suggestion but you may be interested in trying it. Since our arrays are always sorted and never* empty we could simply zero-terminate them for the low count case, gaining a few more elements or even eliminating the indirection altogether, not storing the sizes at all. If the first element is zero, then there's a zero, but every other zero signals end of container.

sounds great.

stdpain avatar Aug 11 '22 08:08 stdpain

More than a month later, but I put that * and never what the clarification was. There's an exception for lazy operations that may leave such a state, but you have to call repair to restore the invariant lest you want undefined behavior.

Oppen avatar Sep 21 '22 16:09 Oppen

I would love the "small bitmap optimization" described in https://github.com/RoaringBitmap/CRoaring/issues/319#issuecomment-949602215 We do something like this but with a wrapper around the type which is annoying since we have to translate all the calls. The wrapper has a union of a roaring::Roaring (this is C++) and an array of 32bit ints plus a "small count" which we overlay on the empty padding at the end of the roaring_array_t. Since sizeof(roaring_array_t) is 40 bytes and we need at least one byte for the count field, that gives us (40 - 1) / 4 = 9 values we can use to hold 32bit values, before we have to convert from the "small" bitmap to a full bitmap.

We've tested this and in our environment, at least, where we can have a wide variety of sizes of bitmap and it really helps performance in our use-cases, even with the one-time overhead of the conversion. I'm sure that if this was supported directly the overhead could be reduced further.

What we do currently is only support a simple subset of operations on this special type of bitmap. If we need to do any of the more complex operations, for example if we're not sure the result will fit into the constant size, we convert it to a normal bitmap first. What we're keeping is basically an array bitmap, but without the overhead of the size (compile-time constant) or capacity (same as the size). Maybe an embedded implementation could get some synergy there.

madscientist avatar Nov 20 '22 15:11 madscientist

Would love to see this incorporated. We have running into this situation a lot when we have sparse bitmaps that need to be merged together by a parallel algorithm.

Our use case is : We have parallel lock free algorithms creating many bitmaps that may have sparse membership and non-unique membership. We chunk many very large files into multiple segments that create their own bitmaps that are merged together at the end of processing all chunks resulting in most bitmaps being sparsely populated. Profiling revealed that insertion into these as roaring bitmaps is very slow compared to a simple bitarray.

josh08287 avatar Nov 14 '23 18:11 josh08287

@josh08287 Can you contribute a benchmark?

See https://github.com/RoaringBitmap/CRoaring/tree/master/microbenchmarks

The difficulty with this issue is that people refer to internal use cases, but there is no corresponding benchmark we can refer to. It means that this issue is unlikely to make progress.

I would never merge an optimization without a corresponding credible benchmark. It is too easy to think we are optimizing when we are not.

lemire avatar Nov 14 '23 19:11 lemire

The PR by @stdpain did not match the current code, and lacked a benchmark to demonstrate the benefits. We need contributions, if only benchmarks, for this issue to make progress.

lemire avatar Nov 14 '23 19:11 lemire

I'll start by providing a benchmark for this case

stdpain avatar Dec 28 '23 07:12 stdpain