fixedbitset
fixedbitset copied to clipboard
Multiple Optimizations
This PR adds the following optimizations and additions:
- Changed Block to be a typedef on usize instead of u32, which should use 2x fewer instructions for similar length bit sets on 64-bit machines. This allows us to skip more in sparse bitsets.
- ~~Changed out the
Vecfor aSmallVec. Using smallvec's union feature, we can store the first 128 bits (on 64-bit machines) inline without allocations. The size of aFixedBitsetremains unchanged.~~ Intersection,Union, andDifferenceshould be faster for denser bitsets by merging at the block level instead of doing individual bit checks.- Added a
zeroesfunction that complements theonesiterator. - Moved all of the tests to a
#[cfg(test)]module so that they're not compiled during non-test builds. Seems to have significantly improved the compile time of the crate.
~~Unfortunately the second point there requires bumping up the MSRV to 1.51. Not sure if this an acceptable change, or if I should put this behind a feature-flag.~~
This partially addresses #73. Though full SIMD support is probably more desirable here.
Reran the benchmarks. Some insights:
- Contains is definitely slower due to the extra on-the-stack vs allocated check smallvec has to do.
- As expected a nearly 2x speedup on 64-bit machines when iterating over more sparse bitsets.
- Moving the smallvec check outside of the inner loop of insert_range keeps it's performance largely the same.
- The slower time when dealing iterating over dense bitsets is unexpected. Needs more investigation.
name master ns/iter usize ns/iter diff ns/iter diff % speedup
bench_insert_range 1,217 1,289 72 5.92% x 0.94
bench_insert_range_using_loop 793,870 1,141,606 347,736 43.80% x 0.70
bench_iter_ones_all_ones 314,546 442,401 127,855 40.65% x 0.71
bench_iter_ones_all_zeros 8,505 4,510 -3,995 -46.97% x 1.89
bench_iter_ones_using_contains_all_ones 544,155 543,391 -764 -0.14% x 1.00
bench_iter_ones_using_contains_all_zeros 543,840 516,121 -27,719 -5.10% x 1.05
bench_iter_ones_using_slice_directly_all_ones 416,640 446,340 29,700 7.13% x 0.93
bench_iter_ones_using_slice_directly_all_zero 8,513 4,267 -4,246 -49.88% x 2.00
Feels like this should be behind a feature (?). For some applications performance of "contains" may be the main bottleneck.
@Calandiel ended up removing the smallvec changes, it's indeed an unacceptable overhead. Reran benchmarks, seems to address the insert and all_ones cases.
name master ns/iter usize ns/iter diff ns/iter diff % speedup
bench_insert_range 1,260 1,212 -48 -3.81% x 1.04
bench_insert_range_using_loop 794,435 962,799 168,364 21.19% x 0.83
bench_iter_ones_all_ones 314,486 422,858 108,372 34.46% x 0.74
bench_iter_ones_all_zeros 8,503 4,277 -4,226 -49.70% x 1.99
bench_iter_ones_using_contains_all_ones 544,015 518,847 -25,168 -4.63% x 1.05
bench_iter_ones_using_contains_all_zeros 543,870 518,844 -25,026 -4.60% x 1.05
bench_iter_ones_using_slice_directly_all_ones 416,365 448,484 32,119 7.71% x 0.93
bench_iter_ones_using_slice_directly_all_zero 8,504 4,280 -4,224 -49.67% x 1.99
@jrraymond mind taking a look? This is being used as a core component in Bevy's ECS scheduler, and this optimizes many of the operations behind it.
Thanks James. I see about the same speedups:
| usize | deviation | u32 | ||
|---|---|---|---|---|
| bench_insert_range | 3,862 | 416 | 3,807 | 0.99 |
| bench_insert_range_using_loop | 2,089,174 | 17,710 | 2,165,208 | 1.04 |
| bench_iter_ones_all_ones | 710,358 | 4,944 | 793,514 | 1.12 |
| bench_iter_ones_all_zeros | 4,911 | 107 | 9,790 | 1.99 |
| bench_iter_ones_using_contains_all_ones | 543,766 | 13,625 | 523,399 | 0.96 |
| bench_iter_ones_using_contains_all_zeros | 543,237 | 5,127 | 523,860 | 0.96 |
| bench_iter_ones_using_slice_directly_all_ones | 414,222 | 2,289 | 367,529 | 0.89 |
| bench_iter_ones_using_slice_directly_all_zero | 4,909 | 99 | 9,797 | 2.00 |
How do you feel about splitting this up into smaller PRs for 1) u32->usize 2) adding 0s iterator 3) moving tests to seperate module? As is the PR is pretty huge
@jrraymond I've updated this PR to only include the usize changes. Please take another look.
Before I merge this, I would like to resolve the performance regressions on the contains benchmarks:
| benchmark | old | dev | new | dev | pct change |
|---|---|---|---|---|---|
| test bench_insert_range | 3,915 ns/iter | (+/- 322) | 3,828 ns/iter | (+/- 1,328) | 2.22% |
| test bench_insert_range_using_loop | 2,087,287 ns/iter | (+/- 5,895) | 2,087,487 ns/iter | (+/- 11,232) | -0.01% |
| test bench_iter_ones_all_ones | 791,916 ns/iter | (+/- 11,288) | 709,237 ns/iter | (+/- 1,731) | 10.44% |
| test bench_iter_ones_all_zeros | 9,804 ns/iter | (+/- 237) | 4,905 ns/iter | (+/- 57) | 49.97% |
| test bench_iter_ones_using_contains_all_ones | 525,739 ns/iter | (+/- 7,121) | 542,470 ns/iter | (+/- 6,671) | -3.18% |
| test bench_iter_ones_using_contains_all_zeros | 524,045 ns/iter | (+/- 10,135) | 542,270 ns/iter | (+/- 3,156) | -3.48% |
| test bench_iter_ones_using_slice_directly_all_ones | 368,727 ns/iter | (+/- 6,322) | 437,931 ns/iter | (+/- 2,819) | -18.77% |
| test bench_iter_ones_using_slice_directly_all_zero | 9,798 ns/iter | (+/- 144) | 4,906 ns/iter | (+/- 36) | 49.93% |
I've been looking at the generated assembly and the differences are exactly what I would expect: the usize version loads qwords instead of dword, increments the counter by 8 instead of 4 shifts the indexes by different amounts.
1 example::iter_ones_using_contains: | 1 example::iter_ones_using_contains:
2 push rbx | 2 push rbx
3 mov r9, qword ptr [rdi + 24] | 3 mov r9, qword ptr [rdi + 24]
4 test r9, r9 | 4 test r9, r9
5 je .LBB31_1 | 5 je .LBB32_1
6 mov r8, qword ptr [rdi] | 6 mov r8, qword ptr [rdi]
7 mov r11, qword ptr [rdi + 16] | 7 mov r11, qword ptr [rdi + 16]
8 cmp r9, 1 | 8 cmp r9, 1
9 jne .LBB31_8 | 9 jne .LBB32_8
10 xor eax, eax | 10 xor eax, eax
11 xor edx, edx | 11 xor edx, edx
12 .LBB31_4: | 12 .LBB32_4:
13 test r9b, 1 | 13 test r9b, 1
14 je .LBB31_7 | 14 je .LBB32_7
15 mov rsi, rdx | 15 mov rsi, rdx
16 shr rsi, 6 | 16 shr rsi, 5
17 cmp rsi, r11 | 17 cmp rsi, r11
18 jae .LBB31_7 | 18 jae .LBB32_7
19 mov edi, 1 | 19 mov edi, 1
20 mov ecx, edx | 20 mov ecx, edx
21 shl rdi, cl | 21 shl edi, cl
22 and rdi, qword ptr [r8 + 8*rsi] | 22 and edi, dword ptr [r8 + 4*rsi]
23 cmp rdi, 1 | 23 cmp edi, 1
24 sbb rax, -1 | 24 sbb rax, -1
25 .LBB31_7: | 25 .LBB32_7:
26 pop rbx | 26 pop rbx
27 ret | 27 ret
28 .LBB31_1: | 28 .LBB32_1:
29 xor eax, eax | 29 xor eax, eax
30 pop rbx | 30 pop rbx
31 ret | 31 ret
32 .LBB31_8: | 32 .LBB32_8:
33 mov r10, r9 | 33 mov r10, r9
34 and r10, -2 | 34 and r10, -2
35 xor eax, eax | 35 xor eax, eax
36 xor esi, esi | 36 xor esi, esi
37 jmp .LBB31_9 | 37 jmp .LBB32_9
38 .LBB31_13: | 38 .LBB32_13:
39 mov rsi, rdx | 39 mov rsi, rdx
40 cmp r10, rdx | 40 cmp r10, rdx
41 je .LBB31_4 | 41 je .LBB32_4
42 .LBB31_9: | 42 .LBB32_9:
43 mov rdi, rsi | 43 mov rdi, rsi
44 shr rdi, 6 | 44 shr rdi, 5
45 cmp rdi, r11 | 45 cmp rdi, r11
46 jae .LBB31_11 | 46 jae .LBB32_11
47 mov ecx, esi | 47 mov ecx, esi
48 and cl, 62 | 48 and cl, 30
49 mov edx, 1 | 49 mov edx, 1
50 shl rdx, cl | 50 shl edx, cl
51 and rdx, qword ptr [r8 + 8*rdi] | 51 and edx, dword ptr [r8 + 4*rdi]
52 cmp rdx, 1 | 52 cmp edx, 1
53 sbb rax, -1 | 53 sbb rax, -1
54 .LBB31_11: | 54 .LBB32_11:
55 lea rdx, [rsi + 2] | 55 lea rdx, [rsi + 2]
56 cmp rdi, r11 | 56 cmp rdi, r11
57 jae .LBB31_13 | 57 jae .LBB32_13
58 and sil, 62 | 58 and sil, 30
59 or sil, 1 | 59 or sil, 1
60 mov ebx, 1 | 60 mov ebx, 1
61 mov ecx, esi | 61 mov ecx, esi
62 shl rbx, cl | 62 shl ebx, cl
63 and rbx, qword ptr [r8 + 8*rdi] | 63 and ebx, dword ptr [r8 + 4*rdi]
64 cmp rbx, 1 | 64 cmp ebx, 1
65 sbb rax, -1 | 65 sbb rax, -1
66 jmp .LBB31_13 | 66 jmp .LBB32_13
Thanks for the deep dive into this.
What's the policy on this crate for the use of unsafe? We can convert a &[usize] into an equivalent &[u32] and then use that to load the target u32. This would retain gains for the batch/set operations without causing a regression in contains, but would necessitate the use of more unsafe.
I've also, in the meantime, experimented with explicitly vectorized operations, using __m128i/__m256i over usize on targets that support it, and we can net a 2-4x speedup over the existing gains on those targets. I won't add it to this PR, but that also would add a significant amount of unsafe to the crate, so I wanted to ask for an opinion ahead of time.
I think we should use unsafe here if it is faster because the scope is narrow and the correctness is easily verifiable. I have no objection to unsafe if there aren't alternatives, so unsafe simd is fine with me.
I took a stab at this to see if this makes a difference:
type SubBlock = u32;
const SUB_BLOCK_BITS: usize = std::mem::size_of::<SubBlock>() * 8;
const SUB_BLOCKS_PER_BLOCK: usize = BITS / SUB_BLOCK_BITS;
let block_ix = bit / BITS;
let block_offset = bit % BITS;
let sub_block_ix = block_ix * SUB_BLOCKS_PER_BLOCK + block_offset / SUB_BLOCK_BITS;
let offset = bit % SUB_BLOCK_BITS;
let blocks: &[usize] = &self.data;
let sublocks: &[SubBlock] = unsafe {
let p: *const SubBlock = std::mem::transmute(blocks.as_ptr());
std::slice::from_raw_parts(p, blocks.len() * SUB_BLOCKS_PER_BLOCK)
};
match sublocks.get(sub_block_ix) {
None => false,
Some(b) => {
(b & (1 << offset)) != 0
}
}
The additional logic to compute the indexes only adds overhead :(. However, in my latest round of benchmarks, contains doesn't show a regression any more (I think I upgraded nightly so maybe something changed there).
| master | var | usize | u64 (w subblocks) | u32 | u16 | u8 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| test bench_insert_range | 2,469 | (+/- 293) | 2,859 | -15.80% | 2,766 | -12.03% | 1,978 | 19.89% | 2,805 | -13.61% | 2,764 | 0.07% |
| test bench_insert_range_using_loop | 2,202,318 | (+/- 32,273) | 2,203,539 | -0.06% | 2,206,400 | -0.19% | 2,203,542 | -0.06% | 2,202,270 | 0.00% | 2,205,833 | 0.03% |
| test bench_iter_ones_all_ones | 791,858 | (+/- 3,819) | 710,383 | 10.29% | 709,991 | 10.34% | 710,495 | 10.27% | 709,458 | 10.41% | 710,395 | -0.06% |
| test bench_iter_ones_all_zeros | 9,782 | (+/- 41) | 4,910 | 49.81% | 4,905 | 49.86% | 4,908 | 49.83% | 4,903 | 49.88% | 4,901 | 0.08% |
| test bench_iter_ones_using_contains_all_ones | 597,916 | (+/- 20,551) | 598,262 | -0.06% | 597,900 | 0.00% | 599,137 | -0.20% | 666,197 | -11.42% | 665,485 | -11.30% |
| test bench_iter_ones_using_contains_all_zeros | 597,850 | (+/- 6,897) | 598,477 | -0.10% | 597,916 | -0.01% | 599,802 | -0.33% | 666,289 | -11.45% | 666,522 | -11.47% |
| test bench_iter_ones_using_slice_directly_all_ones | 367,193 | (+/- 8,208) | 438,308 | -19.37% | 438,979 | -19.55% | 449,346 | -22.37% | 443,264 | -20.72% | 438,887 | 0.02% |
| test bench_iter_ones_using_slice_directly_all_zero | 9,794 | (+/- 611) | 4,905 | 49.92% | 4,908 | 49.89% | 4,915 | 49.82% | 4,909 | 49.88% | 4,914 | -0.12% |
I think the overhead of computing the additional offsets outweighs any benefits, although I'm sure you could improve my logic for calculating sub block offsets:
usize::FixedBitSet::contains:
mov rax, rsi
shr rax, 5
cmp rax, qword ptr [rdi + 16]
jae .LBB10_1
mov rcx, qword ptr [rdi]
mov eax, dword ptr [rcx + 4*rax]
bt eax, esi
setb al
ret
sublocks::FixedBitSet::contains:
mov rax, rsi
shr rax, 3
mov rcx, qword ptr [rdi + 16]
shl rcx, 3
cmp rax, rcx
jae .LBB9_1
mov rcx, qword ptr [rdi]
movzx eax, byte ptr [rcx + rax]
and esi, 7
bt eax, esi
setb al
ret
Given this I'd say this lgtm.
Ultimately I think we should improve the benchmark coverage, and maybe convert them to Criterion for more stable results.
lgtm. please squash commits then merge.