containers icon indicating copy to clipboard operation
containers copied to clipboard

Optimize IntMap.Bin

Open meooow25 opened this issue 1 year ago • 12 comments

Replaces the separate Prefix and Mask fields in the Bin constructor with a single field that contains both.

This reduces the memory required by a Bin from 5 to 4 words, at the cost of more computations (which are cheap bitwise ops) being necessary for certain operations.

Implements #991 for IntMap only. The same can be done for IntSet if this is accepted.

Memory

Concretely, this reduces the memory required by an IntMap by ~12.5% (including the keys and values).

Calculations: For a map with n key-values, there are n Tips costing 3 words each and n-1 Bins costing 5 words each before this change and 4 words after. So we save about 1 out of 8 words per key-value.

Compatibility

This PR, in its current state, makes breaking changes to IntMap internals which is reflected in the exports of the internal module Data.IntMap.Internal. There is no change in the exports of any other module.

Benchmarks

Benchmarks done with GHC 9.6.3. Last updated at 4b708ba. The output is from a hacky script I put together that compares tasty-bench's csv files.

Benchmark command: cabal run <target> -- --csv <csv> +RTS -T

intmap-benchmarks

Name                          Time - - - - - - - -    Allocated - - - - -     Copied - - - - - - -
                                   A       B     %         A       B     %         A       B     %
alter                         484 μs  428 μs  -11%    2.2 MB  1.8 MB  -18%     69 KB   50 KB  -27%
delete                        109 μs   94 μs  -13%    956 KB  763 KB  -20%    148 B   115 B   -22%
foldlWithKey                  485 μs  433 μs  -10%    2.1 MB  1.7 MB  -19%     65 KB   48 KB  -26%
foldlWithKey'                  12 μs   12 μs   +0%      0 B     0 B             0 B     0 B
foldrWithKey                   24 ns   24 ns   -1%    463 B   463 B    +0%      0 B     0 B
fromAscList                    51 μs   46 μs   -9%    256 KB  223 KB  -12%     14 KB  6.1 KB  -55%
fromDistinctAscList            47 μs   43 μs   -7%    256 KB  223 KB  -12%     14 KB  5.9 KB  -58%
fromList                      122 μs   99 μs  -18%    1.0 MB  861 KB  -18%     36 KB   26 KB  -28%
insert                        122 μs   99 μs  -18%    1.0 MB  861 KB  -18%     36 KB   26 KB  -28%
insertLookupWithKey empty     519 μs  537 μs   +3%    4.0 MB  3.7 MB   -8%    158 KB  112 KB  -29%
insertLookupWithKey update    1.7 ms  1.7 ms   +1%    9.6 MB  8.8 MB   -9%    1.1 MB  990 KB  -11%
insertWith empty              127 μs  109 μs  -14%    1.0 MB  863 KB  -17%     35 KB   26 KB  -27%
insertWith update             508 μs  459 μs   -9%    2.3 MB  1.9 MB  -17%    123 KB   93 KB  -24%
insertWith' empty             129 μs  113 μs  -12%    1.0 MB  861 KB  -18%     36 KB   25 KB  -28%
insertWith' update            500 μs  440 μs  -11%    2.2 MB  1.8 MB  -17%     89 KB   59 KB  -33%
insertWithKey empty           126 μs  108 μs  -13%    1.0 MB  863 KB  -17%     35 KB   26 KB  -27%
insertWithKey update          501 μs  459 μs   -8%    2.3 MB  1.9 MB  -17%    123 KB   96 KB  -21%
insertWithKey' empty          126 μs  110 μs  -12%    1.0 MB  861 KB  -18%     36 KB   26 KB  -29%
insertWithKey' update         499 μs  436 μs  -12%    2.2 MB  1.8 MB  -17%     90 KB   61 KB  -31%
lookup_half                   159 μs  111 μs  -30%      0 B     0 B             0 B     0 B
lookup_hits                   162 μs  163 μs   +0%      0 B     0 B             0 B     0 B
lookup_misses                 162 μs  160 μs   +0%      0 B     0 B             0 B     0 B
lookup_mixed                   18 μs   17 μs   -3%      0 B     0 B             0 B     0 B
lookup_most                   155 μs  152 μs   -2%      0 B     0 B             0 B     0 B
map                            34 μs   29 μs  -14%    351 KB  319 KB   -9%     15 KB   12 KB  -17%
mapMaybe                       60 μs   56 μs   -5%    275 KB  255 KB   -6%    5.9 KB  4.9 KB  -16%
mapMaybeWithKey                59 μs   56 μs   -6%    275 KB  255 KB   -6%    5.6 KB  5.1 KB   -9%
mapWithKey                     38 μs   35 μs   -8%    414 KB  383 KB   -7%     21 KB   18 KB  -13%
minView                        40 ns   35 ns  -14%    119 B   103 B   -13%      0 B     0 B
spanAntitone                  115 ns  126 ns   +9%    749 B   655 B   -12%      0 B     0 B
split                          51 ns   48 ns   -6%    486 B   399 B   -17%      0 B     0 B
splitLookup                    51 ns   47 ns   -7%    511 B   422 B   -17%      0 B     0 B
update                        455 μs  400 μs  -12%    2.1 MB  1.7 MB  -17%     68 KB   48 KB  -28%
updateLookupWithKey           900 μs  832 μs   -7%    4.4 MB  3.6 MB  -16%    180 KB  100 KB  -44%

set-operations-intmap

Name                           Time - - - - - - - -    Allocated - - - - -     Copied - - - - - - -
                                    A       B     %         A       B     %         A       B     %
difference-block_nn             26 μs   22 μs  -14%     88 KB   71 KB  -19%    1.0 KB  660 B   -34%
difference-block_nn_swap        26 μs   22 μs  -13%     89 KB   71 KB  -20%    1.0 KB  650 B   -38%
difference-block_ns            2.3 μs  2.1 μs  -10%     13 KB   11 KB  -20%     27 B    16 B   -40%
difference-block_sn_swap       1.9 μs  1.7 μs   -9%    8.8 KB  7.1 KB  -19%     12 B     8 B   -33%
difference-common_nn           1.1 ms  898 μs  -18%    1.9 MB  1.5 MB  -20%    440 KB  300 KB  -31%
difference-common_nn_swap      555 μs  442 μs  -20%      0 B     0 B             0 B     0 B
difference-common_ns           382 μs  309 μs  -19%    1.2 MB  941 KB  -21%    175 KB  108 KB  -38%
difference-common_nt            34 μs   28 μs  -17%    102 KB   81 KB  -20%    1.3 KB  848 B   -36%
difference-common_sn_swap      160 μs  131 μs  -18%      0 B     0 B             0 B     0 B
difference-common_tn_swap       15 μs   14 μs   -3%      0 B     0 B             0 B     0 B
difference-disj_nn              57 ns   51 ns  -10%    294 B   247 B   -15%      0 B     0 B
difference-disj_nn_swap         73 ns   61 ns  -17%    471 B   383 B   -18%      0 B     0 B
difference-disj_ns              54 ns   48 ns  -11%    294 B   247 B   -15%      0 B     0 B
difference-disj_nt              50 ns   41 ns  -17%    294 B   247 B   -15%      0 B     0 B
difference-disj_sn_swap         67 ns   55 ns  -18%    390 B   319 B   -18%      0 B     0 B
difference-disj_tn_swap         47 ns   42 ns  -10%    271 B   223 B   -17%      0 B     0 B
difference-mix_nn              4.9 ms  4.9 ms   +0%    6.0 MB  5.3 MB  -11%    6.8 MB  6.6 MB   -3%
difference-mix_nn_swap         6.3 ms  4.8 ms  -23%    6.0 MB  5.3 MB  -11%    9.6 MB  6.6 MB  -31%
difference-mix_ns              343 μs  285 μs  -16%    1.3 MB  1.0 MB  -20%    205 KB  138 KB  -32%
difference-mix_nt               26 μs   22 μs  -14%    102 KB   81 KB  -20%    1.3 KB  1.0 KB  -24%
difference-mix_sn_swap         213 μs  178 μs  -16%    623 KB  543 KB  -12%     47 KB   35 KB  -24%
difference-mix_tn_swap          13 μs  9.3 μs  -27%     19 KB   17 KB  -11%     50 B    41 B   -18%
intersection-block_nn           13 μs   11 μs  -15%      0 B     0 B             0 B     0 B
intersection-block_nn_swap      13 μs   11 μs  -14%      0 B     0 B             0 B     0 B
intersection-block_ns          1.3 μs  1.2 μs   -6%      0 B     0 B             0 B     0 B
intersection-block_sn_swap     1.3 μs  1.2 μs   -7%      0 B     0 B             0 B     0 B
intersection-common_nn         1.0 ms  831 μs  -16%    1.9 MB  1.5 MB  -19%    415 KB  297 KB  -28%
intersection-common_nn_swap    956 μs  815 μs  -14%    1.9 MB  1.5 MB  -21%    456 KB  299 KB  -34%
intersection-common_ns         184 μs  183 μs   +0%    355 KB  278 KB  -21%     15 KB   10 KB  -35%
intersection-common_nt          13 μs   15 μs  +16%     12 KB  9.8 KB  -17%     23 B    13 B   -43%
intersection-common_sn_swap    202 μs  189 μs   -6%    354 KB  278 KB  -21%     15 KB  9.9 KB  -32%
intersection-common_tn_swap     14 μs   15 μs   +5%     12 KB  9.8 KB  -17%     21 B    13 B   -38%
intersection-disj_nn            35 ns   34 ns   -4%     31 B    31 B    +0%      0 B     0 B
intersection-disj_nn_swap       36 ns   37 ns   +2%     31 B    31 B    +0%      0 B     0 B
intersection-disj_ns            33 ns   32 ns   -3%     31 B    31 B    +0%      0 B     0 B
intersection-disj_nt            30 ns   29 ns   +0%     31 B    31 B    +0%      0 B     0 B
intersection-disj_sn_swap       35 ns   32 ns   -8%     31 B    31 B    +0%      0 B     0 B
intersection-disj_tn_swap       31 ns   28 ns  -10%     31 B    31 B    +0%      0 B     0 B
intersection-mix_nn            1.0 ms  837 μs  -16%      0 B     0 B             0 B     0 B
intersection-mix_nn_swap       984 μs  888 μs   -9%      0 B     0 B             0 B     0 B
intersection-mix_ns            133 μs  117 μs  -11%      0 B     0 B             0 B     0 B
intersection-mix_nt            8.7 μs  6.8 μs  -22%      0 B     0 B             0 B     0 B
intersection-mix_sn_swap       135 μs  121 μs  -10%      0 B     0 B             0 B     0 B
intersection-mix_tn_swap        12 μs  7.5 μs  -34%      0 B    25 B             0 B     0 B
union-block_nn                  33 μs   29 μs  -13%    178 KB  141 KB  -20%    3.9 KB  2.5 KB  -36%
union-block_nn_swap             32 μs   29 μs  -11%    177 KB  142 KB  -19%    3.9 KB  2.5 KB  -35%
union-block_ns                 2.7 μs  2.4 μs   -8%     22 KB   18 KB  -20%     69 B    42 B   -39%
union-block_sn_swap            2.6 μs  2.4 μs   -6%     22 KB   18 KB  -20%     69 B    42 B   -39%
union-common_nn                1.6 ms  1.4 ms  -13%    3.8 MB  3.0 MB  -21%    1.7 MB  1.2 MB  -32%
union-common_nn_swap           1.6 ms  1.4 ms  -13%    3.8 MB  3.0 MB  -21%    1.7 MB  1.2 MB  -32%
union-common_ns                393 μs  337 μs  -14%    1.5 MB  1.2 MB  -20%    288 KB  188 KB  -34%
union-common_nt                 32 μs   27 μs  -13%    114 KB   91 KB  -20%    1.6 KB  1.0 KB  -35%
union-common_sn_swap           400 μs  336 μs  -16%    1.5 MB  1.2 MB  -20%    286 KB  188 KB  -34%
union-common_tn_swap            32 μs   26 μs  -18%    114 KB   91 KB  -19%    1.6 KB  1.0 KB  -35%
union-disj_nn                   75 ns   60 ns  -19%    748 B   607 B   -18%      0 B     0 B
union-disj_nn_swap              83 ns   68 ns  -18%    748 B   607 B   -18%      0 B     0 B
union-disj_ns                   69 ns   56 ns  -19%    671 B   541 B   -19%      0 B     0 B
union-disj_nt                   58 ns   45 ns  -22%    550 B   447 B   -18%      0 B     0 B
union-disj_sn_swap              83 ns   62 ns  -25%    671 B   541 B   -19%      0 B     0 B
union-disj_tn_swap              72 ns   50 ns  -31%    549 B   447 B   -18%      0 B     0 B
union-mix_nn                    10 ms  6.9 ms  -31%    7.6 MB  6.1 MB  -20%     18 MB  9.8 MB  -47%
union-mix_nn_swap              9.3 ms  6.6 ms  -29%    7.6 MB  6.0 MB  -20%     17 MB  9.3 MB  -44%
union-mix_ns                   432 μs  446 μs   +3%    1.7 MB  1.3 MB  -20%    352 KB  372 KB   +5%
union-mix_nt                    29 μs   24 μs  -14%    114 KB   91 KB  -20%    1.6 KB  1.0 KB  -34%
union-mix_sn_swap              415 μs  361 μs  -12%    1.7 MB  1.3 MB  -19%    352 KB  268 KB  -23%
union-mix_tn_swap               27 μs   24 μs  -12%    114 KB   91 KB  -20%    1.6 KB  1.0 KB  -36%

Allocations have gone down in every benchmark, which is expected but quite pleasant to see. Runtime has gone down in most benchmarks and gone up in a few.

Notable time regressions

The only significant regressions (>10%) I see are in the union-disj and intersection-disj set operations. However I don't consider this a problem because these operations on the specific disjoint maps used in the tests wrap up in $O(W)$, due to one the left map being [1..n] and the right being [n+1...n+x]. This is measuring the cost of one set of new bitwise ops vs old bitwise ops, which is unsurprisingly a little slower.
Edit (58cbc6a): union-disj is better now. Edit (4b708ba): intersection-disj is comparable now.

Let me know if anything else stands out as troublesome.

meooow25 avatar Mar 10 '24 02:03 meooow25

This is most impressive. I will do my best to review it promptly. In the mean time, I have a big favor to ask: I was struggling the other day to merge the branch for version 0.7 into master. Do you think you could try to figure out how to do that right?

treeowl avatar Mar 10 '24 04:03 treeowl

(prefix + mask) is just a part of the key with an additional mask bit set, could this be abused for performance?

With a key and a compatible prefix, xor only yields a value at or below the level of the masking bit, whether or not the value is at the masking bit determines the direction (0 is right).

    1800 | 0000011100001000
xor 1807 | 0000011100001111
---------+-----------------
         | 0000000000000111
       
    1800 | 0000011100001000
xor 1795 | 0000011100001000
---------+-----------------
         | 0000000000001011

So the two comparison patterns could be reduced to the following:

-- Single-key operation with eager failure
trimatch p k =
  let o = xor p k
  in case () of
       () | o .&. p == 0                = goRight
          | unsafeShiftR o 1 .&. p == 0 = goLeft
          | otherwise                   = goFail

-- Merge
quadbranch p1 p2 =
  let o = xor p k `unsafeShiftR` 1
  in case () of
       () | p1 == p2      = goEqual
          | o .&. p1 == 0 = goBelowP1
          | o .&. p2 == 0 = goBelowP2
          | otherwise     = goDiffer

BurningWitness avatar Mar 10 '24 05:03 BurningWitness

@treeowl I assume you mean #994. I don't know much about the CI setup but I'll take a look.

With a key and a compatible prefix, xor only yields a value at or below the level of the masking bit, whether or not the value is at the masking bit determines the direction (0 is right).

@BurningWitness it is unclear to me how that would work. For instance, xor p k .&. p == 0, which is the same as k .&. p == p, only tells us whether p's 1-bits agree with k. But we also need to know if it the 0-bits agree.
Probably worth thinking about it though, it would be great to find a more efficient set of bitwise ops to do the job.

meooow25 avatar Mar 10 '24 08:03 meooow25

Oh, true, xor only has that guarantee if prefix matches the key, so I was wrong.

BurningWitness avatar Mar 10 '24 09:03 BurningWitness

Oh, true, xor only has that guarantee if prefix matches the key, so I was wrong.

Ah well if the prefix already matches, the branch into left or right is done with one comparison so it's hard to imagine something faster than that.
What I'd like is to find a better way for the nomatch stuff.

meooow25 avatar Mar 10 '24 14:03 meooow25

I believe nomatch should be pretty much as fast as the old one now! (thanks gcc and godbolt)

Previous: i .&. (m `xor` (-m)) == p — 1 and, 1 xor, 1 negate, 1 compare New: (i `xor` p) .&. (p `xor` (-p)) — 1 and, 2 xor, 1 negate

Correct me if I'm wrong, I know little about things at this low a level.


Now, if only shorter could be faster or avoided in some manner... I suspect that's the only thing causing minor slowdowns for the intersection_disj case.

meooow25 avatar Mar 11 '24 15:03 meooow25

I doubt the comparison costs significantly over the test, but benchmarking is the way to know for sure.

treeowl avatar Mar 11 '24 16:03 treeowl

Okay I found a better way to branch that avoids shorter. Almost every benchmark is faster or the same now! 🎉

@treeowl would you like to review? I think it's in a good shape now.

meooow25 avatar Mar 12 '24 02:03 meooow25

It would be nice to wrap p .|. (p - 1) and p .&. (p - 1) into functions named upper and lower respectively for clarity.


Also this should mean precise lookups can be expressed as

trimatch p k =
  if k < p
    then if k >= lower p
           then goLeft
           else goFail
           
    else if k <= upper p
           then goRight
           else goFail

(I think this is faster than the algorithm listed in the paper, four operations in any direction instead of four for failure plus one for picking a side)

BurningWitness avatar Mar 12 '24 05:03 BurningWitness

@BurningWitness

It would be nice to wrap p .|. (p - 1) and p .&. (p - 1) into functions named upper and lower respectively for clarity.

If that becomes the predominant way of viewing Prefix, then perhaps. Currently we also use nomatch so that is not true.

Also this should mean precise lookups can be expressed as...

I like that, since it uses the same view as mapMapBranch, but it makes no difference on benchmarks for me. A small concern (maybe?) is that this duplicates the failure branch.
So we could switch to it, but I don't really see a good reason to do so. If it is demonstrated to be better in some way, then sure I would favor it.

meooow25 avatar Mar 13 '24 17:03 meooow25

A small concern (maybe?) is that this duplicates the failure branch.

Indeed that is the case, the failure branch needs to be let-bound with a NOINLINE to avoid it. Probably won't be visible on the benchmarks this way either though.

BurningWitness avatar Mar 14 '24 14:03 BurningWitness

I made a local benchmark on a Word-based tree clone and indeed the nomatch solution is ahead performance-wise by a remarkably narrow margin, 1-2% difference.

BurningWitness avatar Mar 14 '24 15:03 BurningWitness

Hi @treeowl, please review when you get the time.

meooow25 avatar Mar 23 '24 12:03 meooow25

I'm ready to merge, but I want to make sure the introduction of the helpers in your last commit didn't hurt performance. Did you check the performance effect of that, or (alternatively) verify that the Core, unfolding, and unfolding guidance are the same?

treeowl avatar Mar 29 '24 02:03 treeowl

Awesome 🎉 I peeked at the Core for union and it seemed alright. But I will run the benchmarks a final time. I also want to go over the PR once to check for bugs, since it touches so many functions. I will update when I'm done, should be some time later today.

meooow25 avatar Mar 29 '24 02:03 meooow25

Ok I think this is good to go.

Updated the benchmarks. I found some odd increases in a few intersection tests that seemed to have originated with 82f4852, a completely unrelated change. Adding a -fproc-alignment=64 flag (as suggested by tasty-bench) fixed that.

I also added the "-with-rtsopts=-A32m" flag from tasty-bench while at it. This greatly reduced the Copied stats however, for both before and after. I think that column is less indicative of anything now, but I've kept it for the sake of it.

Let me know if things look good and I'll squash it into a few commits (benchmark flag, tests, main IntMap changes).

meooow25 avatar Mar 30 '24 03:03 meooow25

Why would you set -A32m here? That seems to change what the benchmarks are measuring quite a lot. The alignment thing seems reasonable, although it's unfortunate we have to do that.

treeowl avatar Mar 30 '24 03:03 treeowl

tasty-bench claims it makes the individual benchmarks less likely to affect one another. I thought that would be good to have as a general improvement (and not to address by any observed issue).

meooow25 avatar Mar 30 '24 03:03 meooow25

Let's wait with that one for now. I'd like to hear opinions from others on a separate PR.

treeowl avatar Mar 30 '24 03:03 treeowl

Sounds good. Removed the flag and re-ran the benchmarks.

meooow25 avatar Mar 30 '24 04:03 meooow25

Aaaaand we're merged! Fantastic work!

treeowl avatar Mar 30 '24 04:03 treeowl

Thanks for reviewing! Next up... IntSet 😄

meooow25 avatar Mar 30 '24 04:03 meooow25