bitpacking icon indicating copy to clipboard operation
bitpacking copied to clipboard

bench: delta benchmark doesn't use {de,}compress_sorted

Open sirupsen opened this issue 1 year ago • 8 comments

Thanks for creating this library!

I was banging my head against the wall trying to reproduce the decompression numbers for delta encoding from the README. 😅 It wasn't until I disassembled I understood why. Turns out decompress_sorted is slower than what is reported now (but more space efficient!), because the benchmarks accidentally do not call that the sorted variant.

I didn't update the README since my x86 test box is an Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz, so all the numbers would change. I'd be happy to regenerate all the numbers on that box and update the README if you'd like!

Out of curiosity, I grepped in tantivy to see where this library was used. I guess it's not used anymore? Would love to know why! Cheers.

@fulmicoton

sirupsen avatar Aug 14 '22 01:08 sirupsen

@sirupsen Only the 4x bitpacker is used: https://github.com/quickwit-oss/tantivy/search?q=BitPacker4x

PSeitz avatar Aug 14 '22 07:08 PSeitz

@PSeitz 👍🏻 Perhaps then there's room for improvements? The delta variant is 5x slower on my machine for 4x Bitpacker 😱

Only ~2B ints per second on my machine versus 10B ints per second for non-sorted!

BitPacker4x/decompress-2
                        thrpt:  [9.8596 Gelem/s 9.8647 Gelem/s 9.8685 Gelem/s]

BitPacker4x/decompress-delta-2
                        thrpt:  [1.9737 Gelem/s 1.9753 Gelem/s 1.9765 Gelem/s]

sirupsen avatar Aug 14 '22 15:08 sirupsen

The optimized path of decompressoin with delta require the sse3 instruction set (it uses _mm_lddqu_si128 ). On Xeon, it was added quite late. I suspect the problem you are experiencing is very specific to your CPU.

fulmicoton avatar Aug 19 '22 09:08 fulmicoton

On my CPU (Ryzen 7 4750U) I get the following

Benchmarking BitPacker4x/decompress-2: Warmin                                             Benchmarking BitPacker4x/decompress-2: Collecting 100 samples in estimated 5.0007 s (32M i                                                                                          Benchmarking BitPacker4x/decompress-2: Analyz                                             BitPacker4x/decompress-2
                        time:   [156.70 ns 157.15 ns 157.70 ns]
                        thrpt:  [8.1168 Gelem/s 8.1452 Gelem/s 8.1686 Gelem/s]
                 change:
                        time:   [+0.6344% +1.0076% +1.5596%] (p = 0.00 < 0.05)
                        thrpt:  [-1.5357% -0.9975% -0.6304%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

Benchmarking BitPacker4x/decompress-delta-2:                                              Benchmarking BitPacker4x/decompress-delta-2: Collecting 100 samples in estimated 5.0003 s                                                                                           Benchmarking BitPacker4x/decompress-delta-2:                                              BitPacker4x/decompress-delta-2
                        time:   [190.95 ns 191.26 ns 191.70 ns]
                        thrpt:  [6.6771 Gelem/s 6.6923 Gelem/s 6.7034 Gelem/s]
                 change:
                        time:   [+1.8717% +2.0810% +2.2612%] (p = 0.00 < 0.05)
                        thrpt:  [-2.2112% -2.0386% -1.8373%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

fulmicoton avatar Aug 19 '22 09:08 fulmicoton

If you are on linux, can you run the following command? lscpu | grep pni

(SSE3 is weirdly named pni on linux)

fulmicoton avatar Aug 19 '22 12:08 fulmicoton

It's there 🤔

napkin:bitpacking $ lscpu | grep pni
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

sirupsen avatar Aug 19 '22 12:08 sirupsen

It might be interesting to investigate:

  • can you double-check that the code path is the one using SIMD

If it is not the case, it coudl be a problem in the rustc compiler (!?) If it is the case, it could be this CPU being very inefficient on some specific instruction.

fulmicoton avatar Aug 19 '22 16:08 fulmicoton

Yep, I verified that with perf(1): CleanShot 2022-08-19 at 17 10 47@2x

The assembly (sorry I wasn't able to quickly get it via objdump due to the mangled symbols), so it's a screenshot from perf-report(1):

CleanShot 2022-08-19 at 17 14 19@2x

sirupsen avatar Aug 19 '22 21:08 sirupsen