CRoaring icon indicating copy to clipboard operation
CRoaring copied to clipboard

Revisit Per Bitmap Allocators

Open Dr-Emann opened this issue 3 months ago • 4 comments

Is your feature request related to a problem? Please describe.

  • I want to be able to be able to control allocation of roaring bitmaps in a thread-safe way, in a library which may not control all uses of roaring bitmaps in the process.
  • I want to be able to e.g. use an allocator that gives out memory e.g. from a non-global slab allocator.
  • As a maintainer of the croaring-rs rust bindings, I want to be able to provide a safe rust API to control allocation in roaring bitmaps:
    • Currently the function to configure a global allocator must be unsafe because the caller must ensure there are not any objects allocated by CRoaring at the time the function to configure the global allocator is called (or races to interact with croaring in another thread)

Describe the solution you'd like I would like to have the option to specify a custom allocator for each bitmap. I would like the allocator functions to take in a 'user_data' parameter, which will also be stored with the bitmap, to allow non-static data to be used for allocation.

I'm imagining an API like:

typedef struct roaring_allocator_s {
    void* (*malloc)(size_t size, void* user_data);
    void* (*realloc)(void* ptr, size_t size, void* user_data);
    void* (*calloc)(size_t count, size_t size, void* user_data);
    void (*free)(void* ptr, void* user_data);
    void* (*aligned_malloc)(size_t alignment, size_t size, void* user_data);
    void (*aligned_free)(void* ptr, void* user_data);
} roaring_allocator_t;

roaring_bitmap_t *roaring_bitmap_create_with_allocator(const roaring_allocator_t *alloc, void *user_data);

Describe alternatives you've considered

If I control all uses of CRoaring in the process, I can come close to safely using a global allocator which relies on global thread local state to refer to non-local data

Additional context

Previous Issues, PRs, discussions:

  • #271 - A per-bitmap allocator, but each container also got a custom allocator. I will argue below that I do not believe this is necessary.
  • #284 - Issue discussing custom allocators after #271 was closed.
  • #358 - Implemented the current global allocator hooks
  • #638 - Somewhat related, the fact that allocators can't safely return null is unfortunate for custom allocators (it would be cool to make a stack-allocator with just an upper limit of memory it will give out).

Previous dismisal about per-bitmap allocators:

CRoaring has the concept of shared container, containers that can be part of several bitmaps. Given that these users would like to have different bitmaps use different memory allocators, this means that we somehow need to trace memory allocation at the container level (bitmaps are made of small independent entities called containers) while resolving the conflict for shared containers if any. link

A per-bitmap dynamically chosen allocator is not something we are going to do. It is has much too great an overhead. It is not just the allocation, but also follow-up operations (deallocation) that need to be tracked and the bitmaps interact with each others, a container allocated in one bitmap, with one memory system could end up within another bitmap that uses another memory allocator. This implies that every container, and some of them can be just a few bytes, needs to track its source. It adds memory usage, computational overhead, complexity overhead, etc. link

I argue that we do not need to track allocators per container, only per bitmap. As far as I can see, there is no way for a container to move from one bitmap to another with the operations we currently expose. If COW is enabled, containers can be shared between bitmaps, which could lead to issues if they are shared between bitmaps with different allocators. However, COW is already documented to be dangerous to use inconsistently. I believe we can just document that if when using custom allocators in combination with COW, it's up to the user to ensure that all allocators match for all bitmaps involved in operations.

Other than COW, containers are created for a specific bitmap, and never leave it, and are deallocated in the context of working with the same bitmap, so they don't need to track their allocator separately.

Performance We already have an indirect function call for every allocation operation after #358.

I envision something like the following (in an internal header):

typedef struct roaring_allocator_context_s {
    const roaring_allocator_t* alloc;
    void* alloc_data;
} roaring_allocator_context_t;

#define ALLOCATOR_CTX_GLOBAL ((roaring_allocator_context_t){NULL, NULL})

static inline void* allocator_malloc(roaring_allocator_context_t alloc_ctx,
                                     size_t size) {
    return alloc_ctx.alloc ? alloc_ctx.alloc->malloc(size, alloc_ctx.alloc_data)
                           : roaring_malloc(size);
}

static inline roaring_allocator_context_t roaring_bitmap_get_allocator_internal(roaring_bitmap_t *r) { /* ... */ }

Most internal functions (which can allocate, even transitively) would need to be updated to take a roaring_allocator_context_t parameter.

I believe this will have a negligible performance impact, but would need to benchmark to verify.

Are you willing to contribute code or documentation toward this new feature? Yes.

Dr-Emann avatar Sep 10 '25 06:09 Dr-Emann

Seems reasonable to me but I am still preoccupied by the complexity. Unavoidably, a good chunk of the container code will need to be modified to accept new parameters. This will propagate everywhere.

Currently the function to configure a global allocator must be unsafe because the caller must ensure there are not any objects allocated by CRoaring at the time the function to configure the global allocator is called (or races to interact with croaring in another thread)

If that's the motivation, why don't we make the allocator thread_local?

Or maybe we find a thread-safe way to assign allocator, but only once. First time you allocate memory, if there is no allocator, it gets assigned... so that you can only have one allocator ever in the entire life of your code (although we could add an override for experts).

Are you quite sure that you want an allocator per bitmap.

I am not going to oppose it, but I want you to be quite sure.

lemire avatar Sep 10 '25 20:09 lemire

With a lot of care, we can emulate per-bitmap allocator with a just a thread local allocator, by setting our allocator before every operation using our bitmap, then restoring the previous one after.

Would need to be a thread local override, I believe, to preserve the existing behavior of setting the allocator and having it apply globally, but I'm not against that as an option instead.

Dr-Emann avatar Sep 12 '25 01:09 Dr-Emann

@Dr-Emann

You are a member of the core team. If you want to do this, I don't think anyone should have the authority to stop you. I trust that if you do it, it will be well executed.

No matter how you square this, it will require a giant PR.

lemire avatar Sep 12 '25 19:09 lemire

Just providing a thread local for the allocator is actually not a huge diff, but it's seeming like a pretty rough performance impact (on the order of ~5% perf hit on repeated clones of a bitmap even when not used).

I think that's at least partly because thread locals when compiled in PIC require a function call, looking to see if we can at least avoid that when compiling a static library, or switch to a different TLS model (-ftls-model=initial-exec?)

Dr-Emann avatar Sep 14 '25 04:09 Dr-Emann