Optimize IntMap.Bin
Replaces the separate Prefix and Mask fields in the Bin constructor with a single field that contains both.
This reduces the memory required by a Bin from 5 to 4 words, at the cost of more computations (which are cheap bitwise ops) being necessary for certain operations.
Implements #991 for IntMap only. The same can be done for IntSet if this is accepted.
Memory
Concretely, this reduces the memory required by an IntMap by ~12.5% (including the keys and values).
Calculations: For a map with n key-values, there are n Tips costing 3 words each and n-1 Bins costing 5 words each before this change and 4 words after. So we save about 1 out of 8 words per key-value.
Compatibility
This PR, in its current state, makes breaking changes to IntMap internals which is reflected in the exports of the internal module Data.IntMap.Internal. There is no change in the exports of any other module.
Benchmarks
Benchmarks done with GHC 9.6.3. Last updated at 4b708ba. The output is from a hacky script I put together that compares tasty-bench's csv files.
Benchmark command: cabal run <target> -- --csv <csv> +RTS -T
intmap-benchmarks
Name Time - - - - - - - - Allocated - - - - - Copied - - - - - - -
A B % A B % A B %
alter 484 μs 428 μs -11% 2.2 MB 1.8 MB -18% 69 KB 50 KB -27%
delete 109 μs 94 μs -13% 956 KB 763 KB -20% 148 B 115 B -22%
foldlWithKey 485 μs 433 μs -10% 2.1 MB 1.7 MB -19% 65 KB 48 KB -26%
foldlWithKey' 12 μs 12 μs +0% 0 B 0 B 0 B 0 B
foldrWithKey 24 ns 24 ns -1% 463 B 463 B +0% 0 B 0 B
fromAscList 51 μs 46 μs -9% 256 KB 223 KB -12% 14 KB 6.1 KB -55%
fromDistinctAscList 47 μs 43 μs -7% 256 KB 223 KB -12% 14 KB 5.9 KB -58%
fromList 122 μs 99 μs -18% 1.0 MB 861 KB -18% 36 KB 26 KB -28%
insert 122 μs 99 μs -18% 1.0 MB 861 KB -18% 36 KB 26 KB -28%
insertLookupWithKey empty 519 μs 537 μs +3% 4.0 MB 3.7 MB -8% 158 KB 112 KB -29%
insertLookupWithKey update 1.7 ms 1.7 ms +1% 9.6 MB 8.8 MB -9% 1.1 MB 990 KB -11%
insertWith empty 127 μs 109 μs -14% 1.0 MB 863 KB -17% 35 KB 26 KB -27%
insertWith update 508 μs 459 μs -9% 2.3 MB 1.9 MB -17% 123 KB 93 KB -24%
insertWith' empty 129 μs 113 μs -12% 1.0 MB 861 KB -18% 36 KB 25 KB -28%
insertWith' update 500 μs 440 μs -11% 2.2 MB 1.8 MB -17% 89 KB 59 KB -33%
insertWithKey empty 126 μs 108 μs -13% 1.0 MB 863 KB -17% 35 KB 26 KB -27%
insertWithKey update 501 μs 459 μs -8% 2.3 MB 1.9 MB -17% 123 KB 96 KB -21%
insertWithKey' empty 126 μs 110 μs -12% 1.0 MB 861 KB -18% 36 KB 26 KB -29%
insertWithKey' update 499 μs 436 μs -12% 2.2 MB 1.8 MB -17% 90 KB 61 KB -31%
lookup_half 159 μs 111 μs -30% 0 B 0 B 0 B 0 B
lookup_hits 162 μs 163 μs +0% 0 B 0 B 0 B 0 B
lookup_misses 162 μs 160 μs +0% 0 B 0 B 0 B 0 B
lookup_mixed 18 μs 17 μs -3% 0 B 0 B 0 B 0 B
lookup_most 155 μs 152 μs -2% 0 B 0 B 0 B 0 B
map 34 μs 29 μs -14% 351 KB 319 KB -9% 15 KB 12 KB -17%
mapMaybe 60 μs 56 μs -5% 275 KB 255 KB -6% 5.9 KB 4.9 KB -16%
mapMaybeWithKey 59 μs 56 μs -6% 275 KB 255 KB -6% 5.6 KB 5.1 KB -9%
mapWithKey 38 μs 35 μs -8% 414 KB 383 KB -7% 21 KB 18 KB -13%
minView 40 ns 35 ns -14% 119 B 103 B -13% 0 B 0 B
spanAntitone 115 ns 126 ns +9% 749 B 655 B -12% 0 B 0 B
split 51 ns 48 ns -6% 486 B 399 B -17% 0 B 0 B
splitLookup 51 ns 47 ns -7% 511 B 422 B -17% 0 B 0 B
update 455 μs 400 μs -12% 2.1 MB 1.7 MB -17% 68 KB 48 KB -28%
updateLookupWithKey 900 μs 832 μs -7% 4.4 MB 3.6 MB -16% 180 KB 100 KB -44%
set-operations-intmap
Name Time - - - - - - - - Allocated - - - - - Copied - - - - - - -
A B % A B % A B %
difference-block_nn 26 μs 22 μs -14% 88 KB 71 KB -19% 1.0 KB 660 B -34%
difference-block_nn_swap 26 μs 22 μs -13% 89 KB 71 KB -20% 1.0 KB 650 B -38%
difference-block_ns 2.3 μs 2.1 μs -10% 13 KB 11 KB -20% 27 B 16 B -40%
difference-block_sn_swap 1.9 μs 1.7 μs -9% 8.8 KB 7.1 KB -19% 12 B 8 B -33%
difference-common_nn 1.1 ms 898 μs -18% 1.9 MB 1.5 MB -20% 440 KB 300 KB -31%
difference-common_nn_swap 555 μs 442 μs -20% 0 B 0 B 0 B 0 B
difference-common_ns 382 μs 309 μs -19% 1.2 MB 941 KB -21% 175 KB 108 KB -38%
difference-common_nt 34 μs 28 μs -17% 102 KB 81 KB -20% 1.3 KB 848 B -36%
difference-common_sn_swap 160 μs 131 μs -18% 0 B 0 B 0 B 0 B
difference-common_tn_swap 15 μs 14 μs -3% 0 B 0 B 0 B 0 B
difference-disj_nn 57 ns 51 ns -10% 294 B 247 B -15% 0 B 0 B
difference-disj_nn_swap 73 ns 61 ns -17% 471 B 383 B -18% 0 B 0 B
difference-disj_ns 54 ns 48 ns -11% 294 B 247 B -15% 0 B 0 B
difference-disj_nt 50 ns 41 ns -17% 294 B 247 B -15% 0 B 0 B
difference-disj_sn_swap 67 ns 55 ns -18% 390 B 319 B -18% 0 B 0 B
difference-disj_tn_swap 47 ns 42 ns -10% 271 B 223 B -17% 0 B 0 B
difference-mix_nn 4.9 ms 4.9 ms +0% 6.0 MB 5.3 MB -11% 6.8 MB 6.6 MB -3%
difference-mix_nn_swap 6.3 ms 4.8 ms -23% 6.0 MB 5.3 MB -11% 9.6 MB 6.6 MB -31%
difference-mix_ns 343 μs 285 μs -16% 1.3 MB 1.0 MB -20% 205 KB 138 KB -32%
difference-mix_nt 26 μs 22 μs -14% 102 KB 81 KB -20% 1.3 KB 1.0 KB -24%
difference-mix_sn_swap 213 μs 178 μs -16% 623 KB 543 KB -12% 47 KB 35 KB -24%
difference-mix_tn_swap 13 μs 9.3 μs -27% 19 KB 17 KB -11% 50 B 41 B -18%
intersection-block_nn 13 μs 11 μs -15% 0 B 0 B 0 B 0 B
intersection-block_nn_swap 13 μs 11 μs -14% 0 B 0 B 0 B 0 B
intersection-block_ns 1.3 μs 1.2 μs -6% 0 B 0 B 0 B 0 B
intersection-block_sn_swap 1.3 μs 1.2 μs -7% 0 B 0 B 0 B 0 B
intersection-common_nn 1.0 ms 831 μs -16% 1.9 MB 1.5 MB -19% 415 KB 297 KB -28%
intersection-common_nn_swap 956 μs 815 μs -14% 1.9 MB 1.5 MB -21% 456 KB 299 KB -34%
intersection-common_ns 184 μs 183 μs +0% 355 KB 278 KB -21% 15 KB 10 KB -35%
intersection-common_nt 13 μs 15 μs +16% 12 KB 9.8 KB -17% 23 B 13 B -43%
intersection-common_sn_swap 202 μs 189 μs -6% 354 KB 278 KB -21% 15 KB 9.9 KB -32%
intersection-common_tn_swap 14 μs 15 μs +5% 12 KB 9.8 KB -17% 21 B 13 B -38%
intersection-disj_nn 35 ns 34 ns -4% 31 B 31 B +0% 0 B 0 B
intersection-disj_nn_swap 36 ns 37 ns +2% 31 B 31 B +0% 0 B 0 B
intersection-disj_ns 33 ns 32 ns -3% 31 B 31 B +0% 0 B 0 B
intersection-disj_nt 30 ns 29 ns +0% 31 B 31 B +0% 0 B 0 B
intersection-disj_sn_swap 35 ns 32 ns -8% 31 B 31 B +0% 0 B 0 B
intersection-disj_tn_swap 31 ns 28 ns -10% 31 B 31 B +0% 0 B 0 B
intersection-mix_nn 1.0 ms 837 μs -16% 0 B 0 B 0 B 0 B
intersection-mix_nn_swap 984 μs 888 μs -9% 0 B 0 B 0 B 0 B
intersection-mix_ns 133 μs 117 μs -11% 0 B 0 B 0 B 0 B
intersection-mix_nt 8.7 μs 6.8 μs -22% 0 B 0 B 0 B 0 B
intersection-mix_sn_swap 135 μs 121 μs -10% 0 B 0 B 0 B 0 B
intersection-mix_tn_swap 12 μs 7.5 μs -34% 0 B 25 B 0 B 0 B
union-block_nn 33 μs 29 μs -13% 178 KB 141 KB -20% 3.9 KB 2.5 KB -36%
union-block_nn_swap 32 μs 29 μs -11% 177 KB 142 KB -19% 3.9 KB 2.5 KB -35%
union-block_ns 2.7 μs 2.4 μs -8% 22 KB 18 KB -20% 69 B 42 B -39%
union-block_sn_swap 2.6 μs 2.4 μs -6% 22 KB 18 KB -20% 69 B 42 B -39%
union-common_nn 1.6 ms 1.4 ms -13% 3.8 MB 3.0 MB -21% 1.7 MB 1.2 MB -32%
union-common_nn_swap 1.6 ms 1.4 ms -13% 3.8 MB 3.0 MB -21% 1.7 MB 1.2 MB -32%
union-common_ns 393 μs 337 μs -14% 1.5 MB 1.2 MB -20% 288 KB 188 KB -34%
union-common_nt 32 μs 27 μs -13% 114 KB 91 KB -20% 1.6 KB 1.0 KB -35%
union-common_sn_swap 400 μs 336 μs -16% 1.5 MB 1.2 MB -20% 286 KB 188 KB -34%
union-common_tn_swap 32 μs 26 μs -18% 114 KB 91 KB -19% 1.6 KB 1.0 KB -35%
union-disj_nn 75 ns 60 ns -19% 748 B 607 B -18% 0 B 0 B
union-disj_nn_swap 83 ns 68 ns -18% 748 B 607 B -18% 0 B 0 B
union-disj_ns 69 ns 56 ns -19% 671 B 541 B -19% 0 B 0 B
union-disj_nt 58 ns 45 ns -22% 550 B 447 B -18% 0 B 0 B
union-disj_sn_swap 83 ns 62 ns -25% 671 B 541 B -19% 0 B 0 B
union-disj_tn_swap 72 ns 50 ns -31% 549 B 447 B -18% 0 B 0 B
union-mix_nn 10 ms 6.9 ms -31% 7.6 MB 6.1 MB -20% 18 MB 9.8 MB -47%
union-mix_nn_swap 9.3 ms 6.6 ms -29% 7.6 MB 6.0 MB -20% 17 MB 9.3 MB -44%
union-mix_ns 432 μs 446 μs +3% 1.7 MB 1.3 MB -20% 352 KB 372 KB +5%
union-mix_nt 29 μs 24 μs -14% 114 KB 91 KB -20% 1.6 KB 1.0 KB -34%
union-mix_sn_swap 415 μs 361 μs -12% 1.7 MB 1.3 MB -19% 352 KB 268 KB -23%
union-mix_tn_swap 27 μs 24 μs -12% 114 KB 91 KB -20% 1.6 KB 1.0 KB -36%
Allocations have gone down in every benchmark, which is expected but quite pleasant to see. Runtime has gone down in most benchmarks and gone up in a few.
Notable time regressions
The only significant regressions (>10%) I see are in the union-disj and intersection-disj set operations. However I don't consider this a problem because these operations on the specific disjoint maps used in the tests wrap up in $O(W)$, due to one the left map being [1..n] and the right being [n+1...n+x]. This is measuring the cost of one set of new bitwise ops vs old bitwise ops, which is unsurprisingly a little slower.
Edit (58cbc6a): union-disj is better now.
Edit (4b708ba): intersection-disj is comparable now.
Let me know if anything else stands out as troublesome.
This is most impressive. I will do my best to review it promptly. In the mean time, I have a big favor to ask: I was struggling the other day to merge the branch for version 0.7 into master. Do you think you could try to figure out how to do that right?
(prefix + mask) is just a part of the key with an additional mask bit set, could this be abused for performance?
With a key and a compatible prefix, xor only yields a value at or below the level of the masking bit, whether or not the value is at the masking bit determines the direction (0 is right).
1800 | 0000011100001000
xor 1807 | 0000011100001111
---------+-----------------
| 0000000000000111
1800 | 0000011100001000
xor 1795 | 0000011100001000
---------+-----------------
| 0000000000001011
So the two comparison patterns could be reduced to the following:
-- Single-key operation with eager failure
trimatch p k =
let o = xor p k
in case () of
() | o .&. p == 0 = goRight
| unsafeShiftR o 1 .&. p == 0 = goLeft
| otherwise = goFail
-- Merge
quadbranch p1 p2 =
let o = xor p k `unsafeShiftR` 1
in case () of
() | p1 == p2 = goEqual
| o .&. p1 == 0 = goBelowP1
| o .&. p2 == 0 = goBelowP2
| otherwise = goDiffer
@treeowl I assume you mean #994. I don't know much about the CI setup but I'll take a look.
With a key and a compatible prefix, xor only yields a value at or below the level of the masking bit, whether or not the value is at the masking bit determines the direction (0 is right).
@BurningWitness it is unclear to me how that would work. For instance, xor p k .&. p == 0, which is the same as k .&. p == p, only tells us whether p's 1-bits agree with k. But we also need to know if it the 0-bits agree.
Probably worth thinking about it though, it would be great to find a more efficient set of bitwise ops to do the job.
Oh, true, xor only has that guarantee if prefix matches the key, so I was wrong.
Oh, true,
xoronly has that guarantee if prefix matches the key, so I was wrong.
Ah well if the prefix already matches, the branch into left or right is done with one comparison so it's hard to imagine something faster than that.
What I'd like is to find a better way for the nomatch stuff.
I believe nomatch should be pretty much as fast as the old one now! (thanks gcc and godbolt)
Previous: i .&. (m `xor` (-m)) == p — 1 and, 1 xor, 1 negate, 1 compare
New: (i `xor` p) .&. (p `xor` (-p)) — 1 and, 2 xor, 1 negate
Correct me if I'm wrong, I know little about things at this low a level.
Now, if only shorter could be faster or avoided in some manner...
I suspect that's the only thing causing minor slowdowns for the intersection_disj case.
I doubt the comparison costs significantly over the test, but benchmarking is the way to know for sure.
Okay I found a better way to branch that avoids shorter. Almost every benchmark is faster or the same now! 🎉
@treeowl would you like to review? I think it's in a good shape now.
It would be nice to wrap p .|. (p - 1) and p .&. (p - 1) into functions named upper and lower respectively for clarity.
Also this should mean precise lookups can be expressed as
trimatch p k =
if k < p
then if k >= lower p
then goLeft
else goFail
else if k <= upper p
then goRight
else goFail
(I think this is faster than the algorithm listed in the paper, four operations in any direction instead of four for failure plus one for picking a side)
@BurningWitness
It would be nice to wrap
p .|. (p - 1)andp .&. (p - 1)into functions namedupperandlowerrespectively for clarity.
If that becomes the predominant way of viewing Prefix, then perhaps. Currently we also use nomatch so that is not true.
Also this should mean precise lookups can be expressed as...
I like that, since it uses the same view as mapMapBranch, but it makes no difference on benchmarks for me. A small concern (maybe?) is that this duplicates the failure branch.
So we could switch to it, but I don't really see a good reason to do so. If it is demonstrated to be better in some way, then sure I would favor it.
A small concern (maybe?) is that this duplicates the failure branch.
Indeed that is the case, the failure branch needs to be let-bound with a NOINLINE to avoid it. Probably won't be visible on the benchmarks this way either though.
I made a local benchmark on a Word-based tree clone and indeed the nomatch solution is ahead performance-wise by a remarkably narrow margin, 1-2% difference.
Hi @treeowl, please review when you get the time.
I'm ready to merge, but I want to make sure the introduction of the helpers in your last commit didn't hurt performance. Did you check the performance effect of that, or (alternatively) verify that the Core, unfolding, and unfolding guidance are the same?
Awesome 🎉
I peeked at the Core for union and it seemed alright. But I will run the benchmarks a final time. I also want to go over the PR once to check for bugs, since it touches so many functions. I will update when I'm done, should be some time later today.
Ok I think this is good to go.
Updated the benchmarks. I found some odd increases in a few intersection tests that seemed to have originated with 82f4852, a completely unrelated change. Adding a -fproc-alignment=64 flag (as suggested by tasty-bench) fixed that.
I also added the "-with-rtsopts=-A32m" flag from tasty-bench while at it. This greatly reduced the Copied stats however, for both before and after. I think that column is less indicative of anything now, but I've kept it for the sake of it.
Let me know if things look good and I'll squash it into a few commits (benchmark flag, tests, main IntMap changes).
Why would you set -A32m here? That seems to change what the benchmarks are measuring quite a lot. The alignment thing seems reasonable, although it's unfortunate we have to do that.
tasty-bench claims it makes the individual benchmarks less likely to affect one another. I thought that would be good to have as a general improvement (and not to address by any observed issue).
Let's wait with that one for now. I'd like to hear opinions from others on a separate PR.
Sounds good. Removed the flag and re-ran the benchmarks.
Aaaaand we're merged! Fantastic work!
Thanks for reviewing! Next up... IntSet 😄