string-interner icon indicating copy to clipboard operation
string-interner copied to clipboard

Faster encoding for lengths in BufferBackend.

Open reinerp opened this issue 2 years ago • 3 comments

Instead of varlen encoding, which requires up to 8 conditional branches in a loop, we use a simpler encoding which is just 4 instructions without branching.

New encoding:

  • length<=254: <1-byte length>
  • otherwise: <0xFF> <usize length> <0xFF>

Improves throughputs by 10-80%, with the biggest improvement in resolve().

get_or_intern/fill-empty/new/BufferBackend
                        time:   [6.4562 ms 6.5068 ms 6.5684 ms]
                        thrpt:  [15.224 Melem/s 15.369 Melem/s 15.489 Melem/s]
                 change:
                        time:   [-8.4289% -7.3761% -6.1844%] (p = 0.00 < 0.05)
                        thrpt:  [+6.5920% +7.9635% +9.2048%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

get_or_intern/fill-empty/with_capacity/BufferBackend
                        time:   [3.8182 ms 3.8467 ms 3.8796 ms]
                        thrpt:  [25.776 Melem/s 25.996 Melem/s 26.190 Melem/s]
                 change:
                        time:   [-18.750% -18.024% -17.246%] (p = 0.00 < 0.05)
                        thrpt:  [+20.840% +21.986% +23.077%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

get_or_intern/already-filled/BufferBackend
                        time:   [2.7638 ms 2.8026 ms 2.8477 ms]
                        thrpt:  [35.116 Melem/s 35.681 Melem/s 36.182 Melem/s]
                 change:
                        time:   [-7.9910% -6.4121% -4.8432%] (p = 0.00 < 0.05)
                        thrpt:  [+5.0897% +6.8514% +8.6851%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

get_or_intern_static/BufferBackend/get_or_intern
                        time:   [2.8467 µs 2.8632 µs 2.8807 µs]
                        thrpt:  [18.051 Melem/s 18.162 Melem/s 18.267 Melem/s]
                 change:
                        time:   [-1.7629% +0.8260% +4.7452%] (p = 0.71 > 0.05)
                        thrpt:  [-4.5302% -0.8193% +1.7945%]
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
get_or_intern_static/BufferBackend/get_or_intern_static
                        time:   [2.8259 µs 2.8456 µs 2.8634 µs]
                        thrpt:  [18.160 Melem/s 18.274 Melem/s 18.401 Melem/s]
                 change:
                        time:   [-6.7386% -3.1051% -0.1965%] (p = 0.06 > 0.05)
                        thrpt:  [+0.1969% +3.2046% +7.2255%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

resolve/already-filled/BufferBackend
                        time:   [239.71 µs 242.17 µs 245.19 µs]
                        thrpt:  [407.85 Melem/s 412.93 Melem/s 417.16 Melem/s]
                 change:
                        time:   [-46.443% -45.470% -44.561%] (p = 0.00 < 0.05)
                        thrpt:  [+80.377% +83.386% +86.719%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

get/already-filled/BufferBackend
                        time:   [2.3035 ms 2.3222 ms 2.3427 ms]
                        thrpt:  [42.686 Melem/s 43.062 Melem/s 43.411 Melem/s]
                 change:
                        time:   [-2.6436% -1.3829% -0.0703%] (p = 0.03 < 0.05)
                        thrpt:  [+0.0703% +1.4023% +2.7154%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

iter/already-filled/BufferBackend
                        time:   [278.13 µs 279.55 µs 281.19 µs]
                        thrpt:  [355.64 Melem/s 357.72 Melem/s 359.54 Melem/s]
                 change:
                        time:   [-46.368% -42.429% -39.796%] (p = 0.00 < 0.05)
                        thrpt:  [+66.102% +73.699% +86.454%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

reinerp avatar Feb 05 '23 20:02 reinerp

@reinerp Sorry for taking so long. (a whole year ...) I want to release a new version including your improvements soon. Could you resolve the conflicts and rebase on the current master? That would be of great help. Given the benchmark results I am even confident that we could make the BufferBackend the new default for the string-interner crate.

Robbepop avatar Feb 06 '24 15:02 Robbepop

On my Macbook Pro M2 these are the speed-ups I see:

get_or_intern/fill-empty/new/BufferBackend
                        time:   [2.2334 ms 2.2388 ms 2.2447 ms]
                        thrpt:  [44.549 Melem/s 44.666 Melem/s 44.775 Melem/s]
                 change:
                        time:   [+0.7617% +1.2424% +1.7823%] (p = 0.00 < 0.05)
                        thrpt:  [-1.7510% -1.2271% -0.7559%]
                        Change within noise threshold.

get_or_intern/fill-empty/with_capacity/BufferBackend
                        time:   [1.2190 ms 1.2209 ms 1.2225 ms]
                        thrpt:  [81.801 Melem/s 81.905 Melem/s 82.031 Melem/s]
                 change:
                        time:   [+2.1111% +3.3640% +4.6335%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4283% -3.2545% -2.0675%]
                        Performance has regressed.

get_or_intern/already-filled/BufferBackend
                        time:   [731.74 µs 733.87 µs 735.69 µs]
                        thrpt:  [135.93 Melem/s 136.26 Melem/s 136.66 Melem/s]
                 change:
                        time:   [-1.4299% -0.5519% +0.4034%] (p = 0.25 > 0.05)
                        thrpt:  [-0.4018% +0.5550% +1.4506%]
                        No change in performance detected.

get_or_intern_static/BufferBackend/get_or_intern
                        time:   [1.1811 µs 1.1842 µs 1.1879 µs]
                        thrpt:  [43.775 Melem/s 43.913 Melem/s 44.027 Melem/s]
                 change:
                        time:   [+1.5763% +1.9736% +2.3511%] (p = 0.00 < 0.05)
                        thrpt:  [-2.2971% -1.9354% -1.5519%]
                        Performance has regressed.

get_or_intern_static/BufferBackend/get_or_intern_static
                        time:   [1.1873 µs 1.1909 µs 1.1948 µs]
                        thrpt:  [43.521 Melem/s 43.665 Melem/s 43.798 Melem/s]
                 change:
                        time:   [+1.8508% +2.2119% +2.5798%] (p = 0.00 < 0.05)
                        thrpt:  [-2.5149% -2.1640% -1.8172%]
                        Performance has regressed.

resolve/already-filled/BufferBackend
                        time:   [129.64 µs 129.76 µs 129.90 µs]
                        thrpt:  [769.85 Melem/s 770.66 Melem/s 771.39 Melem/s]
                 change:
                        time:   [-18.586% -18.438% -18.296%] (p = 0.00 < 0.05)
                        thrpt:  [+22.393% +22.606% +22.828%]
                        Performance has improved.

get/already-filled/BufferBackend
                        time:   [622.42 µs 623.86 µs 625.20 µs]
                        thrpt:  [159.95 Melem/s 160.29 Melem/s 160.66 Melem/s]
                 change:
                        time:   [-6.4554% -6.0793% -5.7250%] (p = 0.00 < 0.05)
                        thrpt:  [+6.0727% +6.4728% +6.9008%]
                        Performance has improved.

iter/already-filled/BufferBackend
                        time:   [97.685 µs 99.065 µs 100.63 µs]
                        thrpt:  [993.74 Melem/s 1.0094 Gelem/s 1.0237 Gelem/s]
                 change:
                        time:   [-86.868% -86.658% -86.434%] (p = 0.00 < 0.05)
                        thrpt:  [+637.13% +649.50% +661.49%]
                        Performance has improved.

So no speed-ups for me for get_or_intern but visible speed-ups for get, resolve and in particular iter which is nice.

Robbepop avatar May 01 '24 06:05 Robbepop

With https://github.com/Robbepop/string-interner/pull/64 we already have some heavy gains while keeping the old var-len encoding and I am not entirely sure if this PR is still adding a lot of improvements on top. @reinerp if you have time please feel free to rebase and re-evaluate. Otherwise I'd like to close this.

Robbepop avatar May 01 '24 07:05 Robbepop

Closed due to inactivity. @reinerp feel free to re-open if you want to work on this.

Robbepop avatar May 27 '24 07:05 Robbepop