fixedbitset Multiple Optimizations

This PR adds the following optimizations and additions:

Changed Block to be a typedef on usize instead of u32, which should use 2x fewer instructions for similar length bit sets on 64-bit machines. This allows us to skip more in sparse bitsets.
~~Changed out the Vec for a SmallVec. Using smallvec's union feature, we can store the first 128 bits (on 64-bit machines) inline without allocations. The size of a FixedBitset remains unchanged.~~
Intersection, Union, and Difference should be faster for denser bitsets by merging at the block level instead of doing individual bit checks.
Added a zeroes function that complements the ones iterator.
Moved all of the tests to a #[cfg(test)] module so that they're not compiled during non-test builds. Seems to have significantly improved the compile time of the crate.

~~Unfortunately the second point there requires bumping up the MSRV to 1.51. Not sure if this an acceptable change, or if I should put this behind a feature-flag.~~

This partially addresses #73. Though full SIMD support is probably more desirable here.

Mar 11 '22 17:03 james7132

Reran the benchmarks. Some insights:

Contains is definitely slower due to the extra on-the-stack vs allocated check smallvec has to do.
As expected a nearly 2x speedup on 64-bit machines when iterating over more sparse bitsets.
Moving the smallvec check outside of the inner loop of insert_range keeps it's performance largely the same.
The slower time when dealing iterating over dense bitsets is unexpected. Needs more investigation.

 name                                           master ns/iter  usize ns/iter  diff ns/iter   diff %  speedup
 bench_insert_range                             1,217           1,289                    72    5.92%   x 0.94
 bench_insert_range_using_loop                  793,870         1,141,606           347,736   43.80%   x 0.70
 bench_iter_ones_all_ones                       314,546         442,401             127,855   40.65%   x 0.71
 bench_iter_ones_all_zeros                      8,505           4,510                -3,995  -46.97%   x 1.89
 bench_iter_ones_using_contains_all_ones        544,155         543,391                -764   -0.14%   x 1.00
 bench_iter_ones_using_contains_all_zeros       543,840         516,121             -27,719   -5.10%   x 1.05
 bench_iter_ones_using_slice_directly_all_ones  416,640         446,340              29,700    7.13%   x 0.93
 bench_iter_ones_using_slice_directly_all_zero  8,513           4,267                -4,246  -49.88%   x 2.00

Mar 11 '22 20:03 james7132

Feels like this should be behind a feature (?). For some applications performance of "contains" may be the main bottleneck.

May 07 '22 06:05 Calandiel

@Calandiel ended up removing the smallvec changes, it's indeed an unacceptable overhead. Reran benchmarks, seems to address the insert and all_ones cases.

 name                                           master ns/iter  usize ns/iter  diff ns/iter   diff %  speedup
 bench_insert_range                             1,260           1,212                   -48   -3.81%   x 1.04
 bench_insert_range_using_loop                  794,435         962,799             168,364   21.19%   x 0.83
 bench_iter_ones_all_ones                       314,486         422,858             108,372   34.46%   x 0.74
 bench_iter_ones_all_zeros                      8,503           4,277                -4,226  -49.70%   x 1.99
 bench_iter_ones_using_contains_all_ones        544,015         518,847             -25,168   -4.63%   x 1.05
 bench_iter_ones_using_contains_all_zeros       543,870         518,844             -25,026   -4.60%   x 1.05
 bench_iter_ones_using_slice_directly_all_ones  416,365         448,484              32,119    7.71%   x 0.93
 bench_iter_ones_using_slice_directly_all_zero  8,504           4,280                -4,224  -49.67%   x 1.99

Jun 08 '22 12:06 james7132

@jrraymond mind taking a look? This is being used as a core component in Bevy's ECS scheduler, and this optimizes many of the operations behind it.

Jun 08 '22 13:06 james7132

Thanks James. I see about the same speedups:

	usize	deviation	u32
bench_insert_range	3,862	416	3,807	0.99
bench_insert_range_using_loop	2,089,174	17,710	2,165,208	1.04
bench_iter_ones_all_ones	710,358	4,944	793,514	1.12
bench_iter_ones_all_zeros	4,911	107	9,790	1.99
bench_iter_ones_using_contains_all_ones	543,766	13,625	523,399	0.96
bench_iter_ones_using_contains_all_zeros	543,237	5,127	523,860	0.96
bench_iter_ones_using_slice_directly_all_ones	414,222	2,289	367,529	0.89
bench_iter_ones_using_slice_directly_all_zero	4,909	99	9,797	2.00

How do you feel about splitting this up into smaller PRs for 1) u32->usize 2) adding 0s iterator 3) moving tests to seperate module? As is the PR is pretty huge

Jun 12 '22 14:06 jrraymond

@jrraymond I've updated this PR to only include the usize changes. Please take another look.

Nov 27 '22 21:11 james7132

Before I merge this, I would like to resolve the performance regressions on the contains benchmarks:

benchmark	old	dev	new	dev	pct change
test bench_insert_range	3,915 ns/iter	(+/- 322)	3,828 ns/iter	(+/- 1,328)	2.22%
test bench_insert_range_using_loop	2,087,287 ns/iter	(+/- 5,895)	2,087,487 ns/iter	(+/- 11,232)	-0.01%
test bench_iter_ones_all_ones	791,916 ns/iter	(+/- 11,288)	709,237 ns/iter	(+/- 1,731)	10.44%
test bench_iter_ones_all_zeros	9,804 ns/iter	(+/- 237)	4,905 ns/iter	(+/- 57)	49.97%
test bench_iter_ones_using_contains_all_ones	525,739 ns/iter	(+/- 7,121)	542,470 ns/iter	(+/- 6,671)	-3.18%
test bench_iter_ones_using_contains_all_zeros	524,045 ns/iter	(+/- 10,135)	542,270 ns/iter	(+/- 3,156)	-3.48%
test bench_iter_ones_using_slice_directly_all_ones	368,727 ns/iter	(+/- 6,322)	437,931 ns/iter	(+/- 2,819)	-18.77%
test bench_iter_ones_using_slice_directly_all_zero	9,798 ns/iter	(+/- 144)	4,906 ns/iter	(+/- 36)	49.93%

Jan 07 '23 23:01 jrraymond

I've been looking at the generated assembly and the differences are exactly what I would expect: the usize version loads qwords instead of dword, increments the counter by 8 instead of 4 shifts the indexes by different amounts.

   1  example::iter_ones_using_contains:                                                                                               |    1 example::iter_ones_using_contains:
    2         push    rbx                                                                                                              |    2         push    rbx
    3         mov     r9, qword ptr [rdi + 24]                                                                                         |    3         mov     r9, qword ptr [rdi + 24]
    4         test    r9, r9                                                                                                           |    4         test    r9, r9
    5         je      .LBB31_1                                                                                                         |    5         je      .LBB32_1
    6         mov     r8, qword ptr [rdi]                                                                                              |    6         mov     r8, qword ptr [rdi]
    7         mov     r11, qword ptr [rdi + 16]                                                                                        |    7         mov     r11, qword ptr [rdi + 16]
    8         cmp     r9, 1                                                                                                            |    8         cmp     r9, 1
    9         jne     .LBB31_8                                                                                                         |    9         jne     .LBB32_8
   10         xor     eax, eax                                                                                                         |   10         xor     eax, eax
   11         xor     edx, edx                                                                                                         |   11         xor     edx, edx
   12 .LBB31_4:                                                                                                                        |   12 .LBB32_4:
   13         test    r9b, 1                                                                                                           |   13         test    r9b, 1
   14         je      .LBB31_7                                                                                                         |   14         je      .LBB32_7
   15         mov     rsi, rdx                                                                                                         |   15         mov     rsi, rdx
   16         shr     rsi, 6                                                                                                           |   16         shr     rsi, 5
   17         cmp     rsi, r11                                                                                                         |   17         cmp     rsi, r11
   18         jae     .LBB31_7                                                                                                         |   18         jae     .LBB32_7
   19         mov     edi, 1                                                                                                           |   19         mov     edi, 1
   20         mov     ecx, edx                                                                                                         |   20         mov     ecx, edx
   21         shl     rdi, cl                                                                                                          |   21         shl     edi, cl
   22         and     rdi, qword ptr [r8 + 8*rsi]                                                                                      |   22         and     edi, dword ptr [r8 + 4*rsi]
   23         cmp     rdi, 1                                                                                                           |   23         cmp     edi, 1
   24         sbb     rax, -1                                                                                                          |   24         sbb     rax, -1
   25 .LBB31_7:                                                                                                                        |   25 .LBB32_7:
   26         pop     rbx                                                                                                              |   26         pop     rbx
   27         ret                                                                                                                      |   27         ret
   28 .LBB31_1:                                                                                                                        |   28 .LBB32_1:
   29         xor     eax, eax                                                                                                         |   29         xor     eax, eax
   30         pop     rbx                                                                                                              |   30         pop     rbx
   31         ret                                                                                                                      |   31         ret
   32 .LBB31_8:                                                                                                                        |   32 .LBB32_8:
   33         mov     r10, r9                                                                                                          |   33         mov     r10, r9
   34         and     r10, -2                                                                                                          |   34         and     r10, -2
   35         xor     eax, eax                                                                                                         |   35         xor     eax, eax
   36         xor     esi, esi                                                                                                         |   36         xor     esi, esi
   37         jmp     .LBB31_9                                                                                                         |   37         jmp     .LBB32_9
   38 .LBB31_13:                                                                                                                       |   38 .LBB32_13:
   39         mov     rsi, rdx                                                                                                         |   39         mov     rsi, rdx
   40         cmp     r10, rdx                                                                                                         |   40         cmp     r10, rdx
   41         je      .LBB31_4                                                                                                         |   41         je      .LBB32_4
   42 .LBB31_9:                                                                                                                        |   42 .LBB32_9:
   43         mov     rdi, rsi                                                                                                         |   43         mov     rdi, rsi
   44         shr     rdi, 6                                                                                                           |   44         shr     rdi, 5
   45         cmp     rdi, r11                                                                                                         |   45         cmp     rdi, r11
   46         jae     .LBB31_11                                                                                                        |   46         jae     .LBB32_11
   47         mov     ecx, esi                                                                                                         |   47         mov     ecx, esi
   48         and     cl, 62                                                                                                           |   48         and     cl, 30
   49         mov     edx, 1                                                                                                           |   49         mov     edx, 1
   50         shl     rdx, cl                                                                                                          |   50         shl     edx, cl
   51         and     rdx, qword ptr [r8 + 8*rdi]                                                                                      |   51         and     edx, dword ptr [r8 + 4*rdi]
   52         cmp     rdx, 1                                                                                                           |   52         cmp     edx, 1
   53         sbb     rax, -1                                                                                                          |   53         sbb     rax, -1
   54 .LBB31_11:                                                                                                                       |   54 .LBB32_11:
   55         lea     rdx, [rsi + 2]                                                                                                   |   55         lea     rdx, [rsi + 2]
   56         cmp     rdi, r11                                                                                                         |   56         cmp     rdi, r11
   57         jae     .LBB31_13                                                                                                        |   57         jae     .LBB32_13
   58         and     sil, 62                                                                                                          |   58         and     sil, 30
   59         or      sil, 1                                                                                                           |   59         or      sil, 1
   60         mov     ebx, 1                                                                                                           |   60         mov     ebx, 1
   61         mov     ecx, esi                                                                                                         |   61         mov     ecx, esi
   62         shl     rbx, cl                                                                                                          |   62         shl     ebx, cl
   63         and     rbx, qword ptr [r8 + 8*rdi]                                                                                      |   63         and     ebx, dword ptr [r8 + 4*rdi]
   64         cmp     rbx, 1                                                                                                           |   64         cmp     ebx, 1
   65         sbb     rax, -1                                                                                                          |   65         sbb     rax, -1
   66         jmp     .LBB31_13                                                                                                        |   66         jmp     .LBB32_13

Jan 08 '23 04:01 jrraymond

Thanks for the deep dive into this.

What's the policy on this crate for the use of unsafe? We can convert a &[usize] into an equivalent &[u32] and then use that to load the target u32. This would retain gains for the batch/set operations without causing a regression in contains, but would necessitate the use of more unsafe.

I've also, in the meantime, experimented with explicitly vectorized operations, using __m128i/__m256i over usize on targets that support it, and we can net a 2-4x speedup over the existing gains on those targets. I won't add it to this PR, but that also would add a significant amount of unsafe to the crate, so I wanted to ask for an opinion ahead of time.

Jan 08 '23 06:01 james7132

I think we should use unsafe here if it is faster because the scope is narrow and the correctness is easily verifiable. I have no objection to unsafe if there aren't alternatives, so unsafe simd is fine with me.

I took a stab at this to see if this makes a difference:

        type SubBlock = u32;
        const SUB_BLOCK_BITS: usize = std::mem::size_of::<SubBlock>() * 8;
        const SUB_BLOCKS_PER_BLOCK: usize = BITS / SUB_BLOCK_BITS;
        let block_ix = bit / BITS;
        let block_offset = bit % BITS;
        let sub_block_ix = block_ix * SUB_BLOCKS_PER_BLOCK + block_offset / SUB_BLOCK_BITS;
        let offset = bit % SUB_BLOCK_BITS;
        let blocks: &[usize] = &self.data;
        let sublocks: &[SubBlock] = unsafe {
            let p: *const SubBlock = std::mem::transmute(blocks.as_ptr());
            std::slice::from_raw_parts(p, blocks.len() * SUB_BLOCKS_PER_BLOCK)
         };
        match sublocks.get(sub_block_ix) {
            None => false,
            Some(b) => {
                (b & (1 << offset)) != 0
            }
        }

The additional logic to compute the indexes only adds overhead :(. However, in my latest round of benchmarks, contains doesn't show a regression any more (I think I upgraded nightly so maybe something changed there).

	master	var	usize		u64 (w subblocks)		u32		u16		u8
test bench_insert_range	2,469	(+/- 293)	2,859	-15.80%	2,766	-12.03%	1,978	19.89%	2,805	-13.61%	2,764	0.07%
test bench_insert_range_using_loop	2,202,318	(+/- 32,273)	2,203,539	-0.06%	2,206,400	-0.19%	2,203,542	-0.06%	2,202,270	0.00%	2,205,833	0.03%
test bench_iter_ones_all_ones	791,858	(+/- 3,819)	710,383	10.29%	709,991	10.34%	710,495	10.27%	709,458	10.41%	710,395	-0.06%
test bench_iter_ones_all_zeros	9,782	(+/- 41)	4,910	49.81%	4,905	49.86%	4,908	49.83%	4,903	49.88%	4,901	0.08%
test bench_iter_ones_using_contains_all_ones	597,916	(+/- 20,551)	598,262	-0.06%	597,900	0.00%	599,137	-0.20%	666,197	-11.42%	665,485	-11.30%
test bench_iter_ones_using_contains_all_zeros	597,850	(+/- 6,897)	598,477	-0.10%	597,916	-0.01%	599,802	-0.33%	666,289	-11.45%	666,522	-11.47%
test bench_iter_ones_using_slice_directly_all_ones	367,193	(+/- 8,208)	438,308	-19.37%	438,979	-19.55%	449,346	-22.37%	443,264	-20.72%	438,887	0.02%
test bench_iter_ones_using_slice_directly_all_zero	9,794	(+/- 611)	4,905	49.92%	4,908	49.89%	4,915	49.82%	4,909	49.88%	4,914	-0.12%

I think the overhead of computing the additional offsets outweighs any benefits, although I'm sure you could improve my logic for calculating sub block offsets:

usize::FixedBitSet::contains:
        mov     rax, rsi
        shr     rax, 5
        cmp     rax, qword ptr [rdi + 16]
        jae     .LBB10_1
        mov     rcx, qword ptr [rdi]
        mov     eax, dword ptr [rcx + 4*rax]
        bt      eax, esi
        setb    al
        ret
        
sublocks::FixedBitSet::contains:
        mov     rax, rsi
        shr     rax, 3
        mov     rcx, qword ptr [rdi + 16]
        shl     rcx, 3
        cmp     rax, rcx
        jae     .LBB9_1
        mov     rcx, qword ptr [rdi]
        movzx   eax, byte ptr [rcx + rax]
        and     esi, 7
        bt      eax, esi
        setb    al
        ret

Given this I'd say this lgtm.

Jan 08 '23 19:01 jrraymond

Ultimately I think we should improve the benchmark coverage, and maybe convert them to Criterion for more stable results.

Jan 09 '23 03:01 james7132

lgtm. please squash commits then merge.

Jan 09 '23 12:01 jrraymond

fixedbitset fixedbitset copied to clipboard

Multiple Optimizations

fixedbitset
fixedbitset copied to clipboard