roaring-rs icon indicating copy to clipboard operation
roaring-rs copied to clipboard

Proposal: Operator context. A new hope for memory churn

Open saik0 opened this issue 2 years ago • 4 comments

Problem statement

While performing binary operations roaring-rs makes every effort to reuse stores from an owned bitmap that will otherwise be dropped and the end of the op. This is generally accomplished by taking the store out of the to-be-dropped bitmap, then then using the assigning equivalent of the operation. A simpler version of this is: Given an owned a the expression a = a & b is equivalent to a &= b.

For bitmap stores, this is always good news. For arrays however, it depends.

For scalar ops

For operators that can only remove values: this is fine. There are simple algorithms for intersection and difference that given some slices mut a, b can traverse both in one pass, copying items from b or from further ahead in a, then truncating at the end.

Operators that can add values are more costly to perform in place: as they may need to insert values from b into a, meaning large portions of a may need to be copied forwards or backwards. Time complexity can be traded for space complexity by writing to a buffer rather than performing the operation in place. Surprisingly, this is how union is implemented today, but not symmetric difference. There is no impl for BitOrAssign for array stores, but there is for BitXOrAssign with a curious // TODO improve this function

To summarize so far:

  1. We want to prevent needlessly allocating when possible
  2. For array stores:
    • For intersection and difference it does not affect the time complexity
    • For union and symmetric difference it does, there are time-memory trade-offs to having an output buffer

SIMD

SIMD is merged, however it further complicates the issues outlined above as none of the block algorithms (as they are written) can run in place. The blocks are stored in full from registers into the output buffer, however not all the values in the block are valid. This is fine, because we truncate the len to the number of valid values, thus the invalid values are overwritten by the next block write (or, for the last block: remain, but are never accessed because they are beyond the vec len). Intersection and difference can be made to operate in place, but that would require some mechanism of storing values without doing a simple store to the buffer. Options are either a SIMD scatter operation, or store to an intermediate buffer then copy to the output. Scatter latencies can be 10-15x compared to a store, storing to a small [u16; 4] then copying whatever subslice is valid is probably better, but obviously still slower than just the store.

More summary:

  1. Array-array SIMD ops are merged, but only wired up for code paths that do not operate in place. For reasons defined in the first section, this is the minority of code paths.

Note: In the experimental SIMD branch we just decided to let all array ops allocate a new vec. As far as time-space tradeoffs go, allocating a new vec, using the vectorized alg, then dropping the old one is still much faster. Unfortunately though this playing fast-and-loose with memory goes against much of the hard work we've already done to be conservative with it.

We also tried keeping a thread local buffer around. It worked well but @Kerollmops pointed out it's got some portability issues.

Proposal

As a solution to the above problems I propose introducing an operator context that is created at the operator entry point and passed down through each layer. Its role is to hold on to stores that can be reused, and either hand them out when necessary, or create new ones if it does not have any. ensure_correct_store can also add and remove from the context as necessary, further reducing needless memory churn. The context will of course be dropped after the op has completed, along with all it's owned stores.

I have validated the idea by implementing a naive context that only holds one array store, only implemented adding/removing for intersection, (not including ensure_correct_store) and demonstrated it's time performance was at least as good as, if not faster than always allocating for all datasets. Being able to hold many arrays, and also bitmap stores, plus wiring up to ensure_correct_store should further decrease time and memory requirements.

Screen Shot 2022-02-02 at 3 01 45 PM

criterion.zip

TL;DR: "Time or memory: pick two!" (but we pay for them with some added complexity)

saik0 avatar Feb 02 '22 23:02 saik0

This is looking very promising. After wiring everything up most SIMD operations are 8% faster, some more like 16%.

saik0 avatar Feb 05 '22 18:02 saik0

I have only glanced at this issue, but given my understanding of the description of memory access patterns, I wonder if compressing store functions would be useful?

workingjubilee avatar Feb 07 '22 00:02 workingjubilee

I wonder if https://github.com/rust-lang/portable-simd/issues/240 functions would be useful?


It is useful for targets that support compressing store operations

Yes.

Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.

And also no.

saik0 avatar Feb 07 '22 05:02 saik0

Outstanding issues to validate proof of concept:

  • Ensure the perf benefits are not only due to preserving data locality, as having good locality is likely in a benchmarking environment, but should not be assumed for real-world use cases. See if the benefits disappear with a shuffling allocator.
  • We might be doing a better job than the system allocator, but let's not assume the application using this library isn't using a different global allocator. Verify this is still a win with jemalloc and mimalloc. If not, it's not worth it.

Assuming we measure the above, and this scheme is still better than naively allocating, I keep seeing this image in my head:

We're basically implementing an operator local allocator. It could make more sense to just be explicit about that and actually impl std::alloc::Allocator

This sort of optimization is going to be high LoE, add some mid-high complexity code. There is no shortage of lower hanging fruit.

saik0 avatar Feb 11 '22 22:02 saik0