fixedbitset icon indicating copy to clipboard operation
fixedbitset copied to clipboard

Multiple Optimizations

Open james7132 opened this issue 3 years ago • 5 comments

This PR adds the following optimizations and additions:

  • Changed Block to be a typedef on usize instead of u32, which should use 2x fewer instructions for similar length bit sets on 64-bit machines. This allows us to skip more in sparse bitsets.
  • ~~Changed out the Vec for a SmallVec. Using smallvec's union feature, we can store the first 128 bits (on 64-bit machines) inline without allocations. The size of a FixedBitset remains unchanged.~~
  • Intersection, Union, and Difference should be faster for denser bitsets by merging at the block level instead of doing individual bit checks.
  • Added a zeroes function that complements the ones iterator.
  • Moved all of the tests to a #[cfg(test)] module so that they're not compiled during non-test builds. Seems to have significantly improved the compile time of the crate.

~~Unfortunately the second point there requires bumping up the MSRV to 1.51. Not sure if this an acceptable change, or if I should put this behind a feature-flag.~~

This partially addresses #73. Though full SIMD support is probably more desirable here.

james7132 avatar Mar 11 '22 17:03 james7132

Reran the benchmarks. Some insights:

  • Contains is definitely slower due to the extra on-the-stack vs allocated check smallvec has to do.
  • As expected a nearly 2x speedup on 64-bit machines when iterating over more sparse bitsets.
  • Moving the smallvec check outside of the inner loop of insert_range keeps it's performance largely the same.
  • The slower time when dealing iterating over dense bitsets is unexpected. Needs more investigation.
 name                                           master ns/iter  usize ns/iter  diff ns/iter   diff %  speedup
 bench_insert_range                             1,217           1,289                    72    5.92%   x 0.94
 bench_insert_range_using_loop                  793,870         1,141,606           347,736   43.80%   x 0.70
 bench_iter_ones_all_ones                       314,546         442,401             127,855   40.65%   x 0.71
 bench_iter_ones_all_zeros                      8,505           4,510                -3,995  -46.97%   x 1.89
 bench_iter_ones_using_contains_all_ones        544,155         543,391                -764   -0.14%   x 1.00
 bench_iter_ones_using_contains_all_zeros       543,840         516,121             -27,719   -5.10%   x 1.05
 bench_iter_ones_using_slice_directly_all_ones  416,640         446,340              29,700    7.13%   x 0.93
 bench_iter_ones_using_slice_directly_all_zero  8,513           4,267                -4,246  -49.88%   x 2.00

james7132 avatar Mar 11 '22 20:03 james7132

Feels like this should be behind a feature (?). For some applications performance of "contains" may be the main bottleneck.

Calandiel avatar May 07 '22 06:05 Calandiel

@Calandiel ended up removing the smallvec changes, it's indeed an unacceptable overhead. Reran benchmarks, seems to address the insert and all_ones cases.

 name                                           master ns/iter  usize ns/iter  diff ns/iter   diff %  speedup
 bench_insert_range                             1,260           1,212                   -48   -3.81%   x 1.04
 bench_insert_range_using_loop                  794,435         962,799             168,364   21.19%   x 0.83
 bench_iter_ones_all_ones                       314,486         422,858             108,372   34.46%   x 0.74
 bench_iter_ones_all_zeros                      8,503           4,277                -4,226  -49.70%   x 1.99
 bench_iter_ones_using_contains_all_ones        544,015         518,847             -25,168   -4.63%   x 1.05
 bench_iter_ones_using_contains_all_zeros       543,870         518,844             -25,026   -4.60%   x 1.05
 bench_iter_ones_using_slice_directly_all_ones  416,365         448,484              32,119    7.71%   x 0.93
 bench_iter_ones_using_slice_directly_all_zero  8,504           4,280                -4,224  -49.67%   x 1.99

james7132 avatar Jun 08 '22 12:06 james7132

@jrraymond mind taking a look? This is being used as a core component in Bevy's ECS scheduler, and this optimizes many of the operations behind it.

james7132 avatar Jun 08 '22 13:06 james7132

Thanks James. I see about the same speedups:

  usize deviation u32  
bench_insert_range 3,862 416 3,807 0.99
bench_insert_range_using_loop 2,089,174 17,710 2,165,208 1.04
bench_iter_ones_all_ones 710,358 4,944 793,514 1.12
bench_iter_ones_all_zeros 4,911 107 9,790 1.99
bench_iter_ones_using_contains_all_ones 543,766 13,625 523,399 0.96
bench_iter_ones_using_contains_all_zeros 543,237 5,127 523,860 0.96
bench_iter_ones_using_slice_directly_all_ones 414,222 2,289 367,529 0.89
bench_iter_ones_using_slice_directly_all_zero 4,909 99 9,797 2.00

How do you feel about splitting this up into smaller PRs for 1) u32->usize 2) adding 0s iterator 3) moving tests to seperate module? As is the PR is pretty huge

jrraymond avatar Jun 12 '22 14:06 jrraymond

@jrraymond I've updated this PR to only include the usize changes. Please take another look.

james7132 avatar Nov 27 '22 21:11 james7132

Before I merge this, I would like to resolve the performance regressions on the contains benchmarks:

benchmark old dev new dev pct change
test bench_insert_range 3,915 ns/iter (+/- 322) 3,828 ns/iter (+/- 1,328) 2.22%
test bench_insert_range_using_loop 2,087,287 ns/iter (+/- 5,895) 2,087,487 ns/iter (+/- 11,232) -0.01%
test bench_iter_ones_all_ones 791,916 ns/iter (+/- 11,288) 709,237 ns/iter (+/- 1,731) 10.44%
test bench_iter_ones_all_zeros 9,804 ns/iter (+/- 237) 4,905 ns/iter (+/- 57) 49.97%
test bench_iter_ones_using_contains_all_ones 525,739 ns/iter (+/- 7,121) 542,470 ns/iter (+/- 6,671) -3.18%
test bench_iter_ones_using_contains_all_zeros 524,045 ns/iter (+/- 10,135) 542,270 ns/iter (+/- 3,156) -3.48%
test bench_iter_ones_using_slice_directly_all_ones 368,727 ns/iter (+/- 6,322) 437,931 ns/iter (+/- 2,819) -18.77%
test bench_iter_ones_using_slice_directly_all_zero 9,798 ns/iter (+/- 144) 4,906 ns/iter (+/- 36) 49.93%

jrraymond avatar Jan 07 '23 23:01 jrraymond

I've been looking at the generated assembly and the differences are exactly what I would expect: the usize version loads qwords instead of dword, increments the counter by 8 instead of 4 shifts the indexes by different amounts.

   1  example::iter_ones_using_contains:                                                                                               |    1 example::iter_ones_using_contains:
    2         push    rbx                                                                                                              |    2         push    rbx
    3         mov     r9, qword ptr [rdi + 24]                                                                                         |    3         mov     r9, qword ptr [rdi + 24]
    4         test    r9, r9                                                                                                           |    4         test    r9, r9
    5         je      .LBB31_1                                                                                                         |    5         je      .LBB32_1
    6         mov     r8, qword ptr [rdi]                                                                                              |    6         mov     r8, qword ptr [rdi]
    7         mov     r11, qword ptr [rdi + 16]                                                                                        |    7         mov     r11, qword ptr [rdi + 16]
    8         cmp     r9, 1                                                                                                            |    8         cmp     r9, 1
    9         jne     .LBB31_8                                                                                                         |    9         jne     .LBB32_8
   10         xor     eax, eax                                                                                                         |   10         xor     eax, eax
   11         xor     edx, edx                                                                                                         |   11         xor     edx, edx
   12 .LBB31_4:                                                                                                                        |   12 .LBB32_4:
   13         test    r9b, 1                                                                                                           |   13         test    r9b, 1
   14         je      .LBB31_7                                                                                                         |   14         je      .LBB32_7
   15         mov     rsi, rdx                                                                                                         |   15         mov     rsi, rdx
   16         shr     rsi, 6                                                                                                           |   16         shr     rsi, 5
   17         cmp     rsi, r11                                                                                                         |   17         cmp     rsi, r11
   18         jae     .LBB31_7                                                                                                         |   18         jae     .LBB32_7
   19         mov     edi, 1                                                                                                           |   19         mov     edi, 1
   20         mov     ecx, edx                                                                                                         |   20         mov     ecx, edx
   21         shl     rdi, cl                                                                                                          |   21         shl     edi, cl
   22         and     rdi, qword ptr [r8 + 8*rsi]                                                                                      |   22         and     edi, dword ptr [r8 + 4*rsi]
   23         cmp     rdi, 1                                                                                                           |   23         cmp     edi, 1
   24         sbb     rax, -1                                                                                                          |   24         sbb     rax, -1
   25 .LBB31_7:                                                                                                                        |   25 .LBB32_7:
   26         pop     rbx                                                                                                              |   26         pop     rbx
   27         ret                                                                                                                      |   27         ret
   28 .LBB31_1:                                                                                                                        |   28 .LBB32_1:
   29         xor     eax, eax                                                                                                         |   29         xor     eax, eax
   30         pop     rbx                                                                                                              |   30         pop     rbx
   31         ret                                                                                                                      |   31         ret
   32 .LBB31_8:                                                                                                                        |   32 .LBB32_8:
   33         mov     r10, r9                                                                                                          |   33         mov     r10, r9
   34         and     r10, -2                                                                                                          |   34         and     r10, -2
   35         xor     eax, eax                                                                                                         |   35         xor     eax, eax
   36         xor     esi, esi                                                                                                         |   36         xor     esi, esi
   37         jmp     .LBB31_9                                                                                                         |   37         jmp     .LBB32_9
   38 .LBB31_13:                                                                                                                       |   38 .LBB32_13:
   39         mov     rsi, rdx                                                                                                         |   39         mov     rsi, rdx
   40         cmp     r10, rdx                                                                                                         |   40         cmp     r10, rdx
   41         je      .LBB31_4                                                                                                         |   41         je      .LBB32_4
   42 .LBB31_9:                                                                                                                        |   42 .LBB32_9:
   43         mov     rdi, rsi                                                                                                         |   43         mov     rdi, rsi
   44         shr     rdi, 6                                                                                                           |   44         shr     rdi, 5
   45         cmp     rdi, r11                                                                                                         |   45         cmp     rdi, r11
   46         jae     .LBB31_11                                                                                                        |   46         jae     .LBB32_11
   47         mov     ecx, esi                                                                                                         |   47         mov     ecx, esi
   48         and     cl, 62                                                                                                           |   48         and     cl, 30
   49         mov     edx, 1                                                                                                           |   49         mov     edx, 1
   50         shl     rdx, cl                                                                                                          |   50         shl     edx, cl
   51         and     rdx, qword ptr [r8 + 8*rdi]                                                                                      |   51         and     edx, dword ptr [r8 + 4*rdi]
   52         cmp     rdx, 1                                                                                                           |   52         cmp     edx, 1
   53         sbb     rax, -1                                                                                                          |   53         sbb     rax, -1
   54 .LBB31_11:                                                                                                                       |   54 .LBB32_11:
   55         lea     rdx, [rsi + 2]                                                                                                   |   55         lea     rdx, [rsi + 2]
   56         cmp     rdi, r11                                                                                                         |   56         cmp     rdi, r11
   57         jae     .LBB31_13                                                                                                        |   57         jae     .LBB32_13
   58         and     sil, 62                                                                                                          |   58         and     sil, 30
   59         or      sil, 1                                                                                                           |   59         or      sil, 1
   60         mov     ebx, 1                                                                                                           |   60         mov     ebx, 1
   61         mov     ecx, esi                                                                                                         |   61         mov     ecx, esi
   62         shl     rbx, cl                                                                                                          |   62         shl     ebx, cl
   63         and     rbx, qword ptr [r8 + 8*rdi]                                                                                      |   63         and     ebx, dword ptr [r8 + 4*rdi]
   64         cmp     rbx, 1                                                                                                           |   64         cmp     ebx, 1
   65         sbb     rax, -1                                                                                                          |   65         sbb     rax, -1
   66         jmp     .LBB31_13                                                                                                        |   66         jmp     .LBB32_13                                                 

jrraymond avatar Jan 08 '23 04:01 jrraymond

Thanks for the deep dive into this.

What's the policy on this crate for the use of unsafe? We can convert a &[usize] into an equivalent &[u32] and then use that to load the target u32. This would retain gains for the batch/set operations without causing a regression in contains, but would necessitate the use of more unsafe.

I've also, in the meantime, experimented with explicitly vectorized operations, using __m128i/__m256i over usize on targets that support it, and we can net a 2-4x speedup over the existing gains on those targets. I won't add it to this PR, but that also would add a significant amount of unsafe to the crate, so I wanted to ask for an opinion ahead of time.

james7132 avatar Jan 08 '23 06:01 james7132

I think we should use unsafe here if it is faster because the scope is narrow and the correctness is easily verifiable. I have no objection to unsafe if there aren't alternatives, so unsafe simd is fine with me.

I took a stab at this to see if this makes a difference:

        type SubBlock = u32;
        const SUB_BLOCK_BITS: usize = std::mem::size_of::<SubBlock>() * 8;
        const SUB_BLOCKS_PER_BLOCK: usize = BITS / SUB_BLOCK_BITS;
        let block_ix = bit / BITS;
        let block_offset = bit % BITS;
        let sub_block_ix = block_ix * SUB_BLOCKS_PER_BLOCK + block_offset / SUB_BLOCK_BITS;
        let offset = bit % SUB_BLOCK_BITS;
        let blocks: &[usize] = &self.data;
        let sublocks: &[SubBlock] = unsafe {
            let p: *const SubBlock = std::mem::transmute(blocks.as_ptr());
            std::slice::from_raw_parts(p, blocks.len() * SUB_BLOCKS_PER_BLOCK)
         };
        match sublocks.get(sub_block_ix) {
            None => false,
            Some(b) => {
                (b & (1 << offset)) != 0
            }
        }

The additional logic to compute the indexes only adds overhead :(. However, in my latest round of benchmarks, contains doesn't show a regression any more (I think I upgraded nightly so maybe something changed there).

master var usize u64 (w subblocks) u32 u16 u8
test bench_insert_range 2,469 (+/- 293) 2,859 -15.80% 2,766 -12.03% 1,978 19.89% 2,805 -13.61% 2,764 0.07%
test bench_insert_range_using_loop 2,202,318 (+/- 32,273) 2,203,539 -0.06% 2,206,400 -0.19% 2,203,542 -0.06% 2,202,270 0.00% 2,205,833 0.03%
test bench_iter_ones_all_ones 791,858 (+/- 3,819) 710,383 10.29% 709,991 10.34% 710,495 10.27% 709,458 10.41% 710,395 -0.06%
test bench_iter_ones_all_zeros 9,782 (+/- 41) 4,910 49.81% 4,905 49.86% 4,908 49.83% 4,903 49.88% 4,901 0.08%
test bench_iter_ones_using_contains_all_ones 597,916 (+/- 20,551) 598,262 -0.06% 597,900 0.00% 599,137 -0.20% 666,197 -11.42% 665,485 -11.30%
test bench_iter_ones_using_contains_all_zeros 597,850 (+/- 6,897) 598,477 -0.10% 597,916 -0.01% 599,802 -0.33% 666,289 -11.45% 666,522 -11.47%
test bench_iter_ones_using_slice_directly_all_ones 367,193 (+/- 8,208) 438,308 -19.37% 438,979 -19.55% 449,346 -22.37% 443,264 -20.72% 438,887 0.02%
test bench_iter_ones_using_slice_directly_all_zero 9,794 (+/- 611) 4,905 49.92% 4,908 49.89% 4,915 49.82% 4,909 49.88% 4,914 -0.12%

I think the overhead of computing the additional offsets outweighs any benefits, although I'm sure you could improve my logic for calculating sub block offsets:

usize::FixedBitSet::contains:
        mov     rax, rsi
        shr     rax, 5
        cmp     rax, qword ptr [rdi + 16]
        jae     .LBB10_1
        mov     rcx, qword ptr [rdi]
        mov     eax, dword ptr [rcx + 4*rax]
        bt      eax, esi
        setb    al
        ret
        
sublocks::FixedBitSet::contains:
        mov     rax, rsi
        shr     rax, 3
        mov     rcx, qword ptr [rdi + 16]
        shl     rcx, 3
        cmp     rax, rcx
        jae     .LBB9_1
        mov     rcx, qword ptr [rdi]
        movzx   eax, byte ptr [rcx + rax]
        and     esi, 7
        bt      eax, esi
        setb    al
        ret

Given this I'd say this lgtm.

jrraymond avatar Jan 08 '23 19:01 jrraymond

Ultimately I think we should improve the benchmark coverage, and maybe convert them to Criterion for more stable results.

james7132 avatar Jan 09 '23 03:01 james7132

lgtm. please squash commits then merge.

jrraymond avatar Jan 09 '23 12:01 jrraymond