roaring-rs
roaring-rs copied to clipboard
Proposal: Operator context. A new hope for memory churn
Problem statement
While performing binary operations roaring-rs
makes every effort to reuse stores from an owned bitmap that will otherwise be dropped and the end of the op. This is generally accomplished by taking the store out of the to-be-dropped bitmap, then then using the assigning equivalent of the operation. A simpler version of this is: Given an owned a
the expression a = a & b
is equivalent to a &= b
.
For bitmap stores, this is always good news. For arrays however, it depends.
For scalar ops
For operators that can only remove values: this is fine. There are simple algorithms for intersection and difference that given some slices mut a, b
can traverse both in one pass, copying items from b
or from further ahead in a
, then truncating at the end.
Operators that can add values are more costly to perform in place: as they may need to insert values from b
into a
, meaning large portions of a
may need to be copied forwards or backwards. Time complexity can be traded for space complexity by writing to a buffer rather than performing the operation in place. Surprisingly, this is how union is implemented today, but not symmetric difference. There is no impl for BitOrAssign
for array stores, but there is for BitXOrAssign
with a curious // TODO improve this function
To summarize so far:
- We want to prevent needlessly allocating when possible
- For array stores:
- For intersection and difference it does not affect the time complexity
- For union and symmetric difference it does, there are time-memory trade-offs to having an output buffer
SIMD
SIMD is merged, however it further complicates the issues outlined above as none of the block algorithms (as they are written) can run in place. The blocks are stored in full from registers into the output buffer, however not all the values in the block are valid. This is fine, because we truncate the len to the number of valid values, thus the invalid values are overwritten by the next block write (or, for the last block: remain, but are never accessed because they are beyond the vec len). Intersection and difference can be made to operate in place, but that would require some mechanism of storing values without doing a simple store to the buffer. Options are either a SIMD scatter operation, or store to an intermediate buffer then copy to the output. Scatter latencies can be 10-15x compared to a store, storing to a small [u16; 4]
then copying whatever subslice is valid is probably better, but obviously still slower than just the store.
More summary:
- Array-array SIMD ops are merged, but only wired up for code paths that do not operate in place. For reasons defined in the first section, this is the minority of code paths.
Note: In the experimental SIMD branch we just decided to let all array ops allocate a new vec. As far as time-space tradeoffs go, allocating a new vec, using the vectorized alg, then dropping the old one is still much faster. Unfortunately though this playing fast-and-loose with memory goes against much of the hard work we've already done to be conservative with it.
We also tried keeping a thread local buffer around. It worked well but @Kerollmops pointed out it's got some portability issues.
Proposal
As a solution to the above problems I propose introducing an operator context that is created at the operator entry point and passed down through each layer. Its role is to hold on to stores that can be reused, and either hand them out when necessary, or create new ones if it does not have any. ensure_correct_store
can also add and remove from the context as necessary, further reducing needless memory churn. The context will of course be dropped after the op has completed, along with all it's owned stores.
I have validated the idea by implementing a naive context that only holds one array store, only implemented adding/removing for intersection, (not including ensure_correct_store
) and demonstrated it's time performance was at least as good as, if not faster than always allocating for all datasets. Being able to hold many arrays, and also bitmap stores, plus wiring up to ensure_correct_store
should further decrease time and memory requirements.
TL;DR: "Time or memory: pick two!" (but we pay for them with some added complexity)
This is looking very promising. After wiring everything up most SIMD operations are 8% faster, some more like 16%.
I have only glanced at this issue, but given my understanding of the description of memory access patterns, I wonder if compressing store functions would be useful?
I wonder if https://github.com/rust-lang/portable-simd/issues/240 functions would be useful?
It is useful for targets that support compressing store operations
Yes.
Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
And also no.
Outstanding issues to validate proof of concept:
- Ensure the perf benefits are not only due to preserving data locality, as having good locality is likely in a benchmarking environment, but should not be assumed for real-world use cases. See if the benefits disappear with a shuffling allocator.
- We might be doing a better job than the system allocator, but let's not assume the application using this library isn't using a different global allocator. Verify this is still a win with jemalloc and mimalloc. If not, it's not worth it.
Assuming we measure the above, and this scheme is still better than naively allocating, I keep seeing this image in my head:
We're basically implementing an operator local allocator. It could make more sense to just be explicit about that and actually impl std::alloc::Allocator
This sort of optimization is going to be high LoE, add some mid-high complexity code. There is no shortage of lower hanging fruit.