CRoaring redesign how containers are stored : move `typecodes` in tagged pointer, replace pointers by actual container values, so on

The roaring_array_t could be optimized by moving the typecode enum in the pointer itself. On x86-64 pointers must be aligned, thus freeing the last 3 bits. That's enough space to specify the type of containers in the pointer itself. The goal is to minimize memory usage (and thus cache friendliness).

Note that it should also be possible to store the key (prefix) in this pointer too, since only 48 bits are addressable, but this will impact the initial bisect search for a prefix container.

Feb 08 '16 14:02 fsaintjacques

Sounds like a good idea but I think that this needs to be synced with the work that @owenkaser is doing,

Feb 08 '16 16:02 lemire

Unfortunately, I don't expect to do anything more for at least a week, maybe two. So the risk of merge conflicts is low. Packing the type code in the low-order bits seems like a good idea. Avoiding the *uint8_t and uint8_t parameters for a separate type-code parameter will probably pay for the bit masking.

If we borrow the high order 16 bits of the pointer for key, do we know how many years it will be until Intel decides that those bits are needed?

Also, it might be nicer (cachewise) to conduct most of the binary search over the more concise uint16_t array than the new alternative. -owen

On Mon, Feb 8, 2016 at 3:21 PM, Daniel Lemire [email protected] wrote:

Sounds like a good idea but I think that this needs to be synced with the work that @owenkaser https://github.com/owenkaser is doing,

— Reply to this email directly or view it on GitHub https://github.com/RoaringBitmap/CRoaring/issues/5#issuecomment-181458574 .

Feb 08 '16 22:02 owenkaser

I also really like the idea of hiding away the type-code parameters. Even better if we can find a way to do it that is elegant.

@fsaintjacques : want to tackle this optimization?

If we borrow the high order 16 bits of the pointer for key, do we know how many years it will be until Intel decides that those bits are needed? Also, it might be nicer (cachewise) to conduct most of the binary search over the more concise uint16_t array than the new alternative.

I am guessing that we can check the addressable range using the cpuid instruction. So you could do it, but add a check when you build the code. I am more concerned with the loss of code clarity and the potential for (small) performance losses.

Feb 09 '16 00:02 lemire

Looks like 16 TB of RAM in Java is a thing

https://www.linkedin.com/pulse/javaone-2015-operating-16tb-jvm-antoine-chambille

Feb 09 '16 00:02 lemire

On Mon, Feb 8, 2016 at 4:02 PM, Daniel Lemire [email protected] wrote:

If we borrow the high order 16 bits of the pointer for key, do we know how many years it will be until Intel decides that those bits are needed?

Unknown, but I don't think it will be a problem in practice. The current belief is that 48-bit virtual addresses should hold us for quite some time to come. HP's "Machine" is one of the largest memory systems currently proposed (, and they have announced that they will stick with 48-bit virtual: http://www.nextplatform.com/2016/01/25/the-bits-and-bytes-of-the-machines-storage/

It's worth noting that with paging, the physical address space of the whole machine can be much larger than the virtual address space of each process. It's also worth noting that while these high bits are not currently used, the AMD64 spec (official name of the thing colloquially referred to as x64 or x86_64) specifies that these bits do need to all be set to the same value, so masking will be required.

Another possibility we might consider would be dropping down to 32-bit "offsets" instead of storing pointers. I'm not sure if this would be possible or not, but in addition to shaving some size it might allow for better integration with mmap(). That is, if instead of storing 64-bit pointers in memory we were to store "offset" * 16B (or 32B) from a base address, we might be able to have the on disk representation match the in-memory representation.

I'd be interested in talking over the potential benefit of this approach if anyone is interested.

Also, it might be nicer (cachewise) to conduct most of the binary search over the more concise uint16_t array than the new alternative.

I strongly agree with this: anything we are going to search over should be packed as densely as it can be. Not doing so hurts us on cache efficiency, but uses memory bandwidth reduces the gain from vectorization as SIMD continues to gets wider. Splitting data into parallel dense streams is a worthy target.

I am guessing that we can check the addressable range using the cpuid instruction. So you could do it, but add a check when you build the code. I am more concerned with the loss of code clarity and the potential for (small) performance losses.

Looks like yes, CPUID with 80000008H in EAX will give the number of physical address bits in AX, and the number of "linear" address bits in AH. I think "linear" is the same as "virtual", but I'm not certain.

--nate

Feb 09 '16 00:02 nkurz

As I said initially, the uint16_t are not worth packing. I will concentrate on packing the type tags.

Feb 09 '16 01:02 fsaintjacques

What @nkurz suggest, with 32-bit offsets, is appealing and I wonder if this design issue (that goes beyond the current issue) shouldn't be discussed seriously.

What @owenkaser implemented is perfectly reasonable but there are other interesting alternatives. Why don't we pack the container structs in one array, making sure that they all have the same size (possibly using padding). This is a variation on Nate's idea, but where the offsets are just flat. Think about summing the cardinality of all of the containers or doing a select:

https://github.com/RoaringBitmap/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/buffer/ImmutableRoaringBitmap.java#L1005-L1026

If all the structs are stored consecutively, you avoid cache misses.

In the current code, computing the cardinality of a roaring bitmap would mean accessing each and every container... possibly generating cache issues.

Another thing to consider is supporting copy-on-write at the container level.

Feb 09 '16 01:02 lemire

Another issue is that with memory-file mapping, the pointer might not be word-aligned.

It seems to me that we would win most by "standardizing the container size". They are all small, and all made of pointers to other data elsewhere. So, really, the pointer-to-container thing is a waste.

Feb 16 '16 16:02 lemire

The 3 structs are of the following form:

struct container {
  int32_t cardinality;
  int32_t capacity; // not used in bitset
  void* pointer_to_data;
};

This fits in 128bit (capacity can probably be shrinked with further limitations).

The offset strategy will make it very cumbersome to resize at will the run and array type containers AND make it work with memory-mapped backend.

Feb 16 '16 17:02 fsaintjacques

@fsaintjacques

Right. Details aside, we can have the struct fit in a nice 128 bits, as you say. It is a nice round number. There will be some waste, but if you save a pointer, you save 64 bits...

We have enough space in there to have the typecodes... for example, you allocate 32 bits to cardinality, but none of the containers can store more than 1<<16 values... so there is a good margin.

Feb 16 '16 17:02 lemire

are you thinking to maintain a containers array per bitmap (perhaps kept ordered by key), or one shared by all bitmaps (presumably disordered, and using some freelist approach)?

Feb 16 '16 17:02 owenkaser

@owenkaser

I was thinking about one container array.

If you are to implement copy-on-write, then you'd have to copy the container struct... but it is only 128 bits.

I am almost certain it would use less memory (saving the pointer) and be slightly faster.

Feb 16 '16 17:02 lemire

If we limit capacity to be 2^16 - 1, we can shrink it to uint16_t and properly use uint8_t without bit hacks, but then the number of elements in array might not be a multiple of _m256i (so many things to consider!).

By having containers in array, what do you think of pre-allocating in slabs of 4k:

struct roaring {
  uint16_t n_containers;
  uint16_t *prefixes;
  // n_slabs can be derived from n_containers
  struct container_slab* slabs;
}

// fits in a page.
struct container_slab {
  struct container containers[256];
}

Slabs have the nice property of being multiples of PAGE_SIZE, thus easily mmap-able.

Feb 16 '16 18:02 fsaintjacques

Note that it doesn't change much from a standard array, but the underneath memory allocator probably allocate in chunks of 4k.

Feb 16 '16 18:02 fsaintjacques

@lemire

Is there any reason to think that copy on write would usually be a big win? My impression is that it would have to come with awkward machinery for reference counting etc, so that we know when to free memory. Your Go implementation uses the language's GC, right?

Feb 16 '16 18:02 owenkaser

I would skip COW mechanism for the first implementation.

Feb 16 '16 18:02 fsaintjacques

@lemire With one big containers array shared by all bitmaps, your dream of an efficient scan to find a single bitmap's cardinality may be hard to realize. Are there mmapping consequences too?

@fsaintjacques Within a slab, would you allow containers from multiple bitmaps? If not, how would we stop tiny bitmaps from causing internal fragmentation?

Feb 16 '16 18:02 owenkaser

@owenkaser

It is not "my" Go implementation, but yes, it relies on the language's GC.

I do not plan to implement COW in CRoaring. I am just playing devil's advocate so that we do not make the design less flexible than it needs to be.

Feb 16 '16 19:02 lemire

@fsaintjacques

In many applications, you only have a handful of containers per bitmap. Allocating them in blocks of 256 would be terribly wasteful. It seems much more prudent to allocate arrays the usually manner... while keep space for further tuning using customized memory allocation as an extra feature.

Feb 16 '16 19:02 lemire

@fsaintjacques

I think that going with 128-bit containers is the wisest course of action, and that's the one you suggested.

Right now, we are keeping it close to the Java/Go implementations, but we might part way at some point... it would be interesting then to have enough "room" to support other container types.

Feb 16 '16 19:02 lemire

Shall we defined only one struct and use typedefs, e.g.

struct container_s {
  int32_t cardinality;
  uint32_t capacity;
  void *data;
};

typedef struct container_s bitset_container_t;
typedef struct container_s array_container_t;
typedef struct container_s run_container_t;

I'll open a branch for this experimentation.

Feb 17 '16 01:02 fsaintjacques

@fsaintjacques

Is it possible to do the job with a union?

Feb 17 '16 01:02 lemire

How would you do the union? I foresee a declaration order ambiguity if the tag is in the pointer to the data. Do you envision something like:

/** 
 * Requires an implicit ordering in the definitions of the containers structs. This is required 
 * to find the tag type. But, the sizeof(container_t) will always be correct and explicit.
 */
union container_u {
   struct bitset_container_t bitset;
   struct array_container_t array;
   struct run_container_t run;
}

Feb 17 '16 02:02 fsaintjacques

@fsaintjacques

I don't know how to make it work but the piece of code is elegant, isn't it?

Feb 17 '16 02:02 lemire

The original proposition is succinct and explicit but might make it harder for future expansion. The union version is friendly to future expansion but induce a dangerous ordering (and alignment!) dependency that can be solved by having an explicit enum/tag but lose the 128bits packing per container type.

Feb 17 '16 02:02 fsaintjacques

I'm looking over the run code, and I'll have to go over it. I spotted a few bug.

Feb 17 '16 14:02 fsaintjacques

@fsaintjacques We now have four container types, including the COW containers.

Sep 03 '16 14:09 lemire

We need to revist this point as I think that CRoaring still leaves performance on the table.

Sep 03 '16 14:09 lemire

This issue is still outstanding. Instead of using a pointer to a container, we should be using a union type. The current design works, but it is unnecessarily inefficient.

Oct 29 '20 17:10 lemire

The roaring_array_t could be optimized by moving the typecode enum in the pointer itself. On x86-64 pointers must be aligned, thus freeing the last 3 bits. That's enough space to specify the type of containers in the pointer itself. The goal is to minimize memory usage (and thus cache friendliness).

Although it is a known technique used by projects like V8, the very fact that it is commonly exploited means that chances are not small that it will be used by some layer of a system.

This means that weirder compilers and transpilers will leverage it in their C implementation itself...or be built on a runtime stack that uses it somewhere.

So I'd be wary of doing this if your codebase is not itself the platform. A good generic C bitset library should conform to the standard, to make it usable anywhere.

That said: as long as it can be turned off with a #define somehow, it's not a bad idea if it doesn't complicate other things too much. Though I'd suggest prioritizing other reorganizations first.

Nov 02 '20 20:11 AE1020