containers
containers copied to clipboard
Map performance - use pattern match on size to reduce recursive function calls
This change incorporates the idea already used for foldMap and traverse, where we
pattern match on the size of "leaf nodes". If the size is 1, we already know both sides
are tips, and we don't need to recurse.
Of particular interest is mapMaybe, where we can also omit the extra function calls to link.
Benchmarks show improvements around 10 to 20% for most operations affected,
though foldl' seems closer to 30%.
❯ ./map-benchmarks-with-these-changes
All
map: OK
53.6 μs ± 3.8 μs
map really: OK
128 μs ± 7.5 μs
mapWithKey: OK
66.4 μs ± 4.1 μs
foldlWithKey: OK
796 μs ± 36 μs
foldlWithKey': OK
17.7 μs ± 909 ns
foldrWithKey: OK
61.1 ns ± 4.7 ns
foldrWithKey': OK
34.9 μs ± 3.4 μs
mapMaybe: OK
112 μs ± 7.9 μs
mapMaybeWithKey: OK
111 μs ± 2.4 μs
filter: OK
49.5 μs ± 4.3 μs
filter really: OK
70.5 μs ± 4.1 μs
❯ ./map-benchmarks-as-per-master
All
map: OK
65.0 μs ± 5.5 μs
map really: OK
122 μs ± 10 μs
mapWithKey: OK
76.1 μs ± 6.6 μs
foldlWithKey: OK
821 μs ± 30 μs
foldlWithKey': OK
24.9 μs ± 1.5 μs
foldrWithKey: OK
64.5 ns ± 4.0 ns
foldrWithKey': OK
37.2 μs ± 1.8 μs
mapMaybe: OK
134 μs ± 8.1 μs
mapMaybeWithKey: OK
131 μs ± 7.3 μs
filter: OK
62.0 μs ± 5.7 μs
filter really: OK
81.0 μs ± 4.7 μs
Some (very) cursory analysis says that about 1/3 of Bin nodes in a tree should have no children, hence changes up to 30% do seem plausible.
Pattern matching on a strict and unboxed Int should be very fast.
Thanks for testing this out. The improvements seem decent but I've always found this undocumented special case strange and I'm not sure we want to proliferate this pattern. If we really want to avoid visiting Tips why not do it properly? For foldMap this can look like
foldMap :: Monoid m => (a -> m) -> Set a -> m
foldMap _ Tip = mempty
foldMap f (Set _ x l r) = go x l r
where
go x l r = case (l, r) of
(Bin _ lx ll lr, Bin _ rx rl rr) -> go lx ll lr <> f x <> go rx rl rr
(Bin _ lx ll lr, Tip) -> go lx ll lr <> f x
(Tip, Bin rx rl rr) -> f x <> go rx rl rr
(Tip, Tip) -> f x
But I haven't measured the effects of such a change.
The version you posted is about 10 to 15% slower than what's there already. See this commit which adds the change it to foldMap.
It's worth asking why it's slower – it seems that pattern matching on the strict packed integer should be very fast, as there's no pointer chasing or indirection involved, pattern matching on the tips does require some indirection.
From the comment in the link above, eliding 2/3 of the recursive calls to Tip is a big deal, because there are a lot of Tips, effectively about the same number as the number of items in the map.
Maybe we should add a GHC style note about this optimisation.
Edit: I've written one.
The version you posted is about 10 to 15% slower than what's there already.
Can you share more on this? This cannot be universally true, there are trade-offs here. It also disagrees with my benchmarks below.
...pattern matching on the tips does require some indirection.
It does not, thanks to GHC's pointer tagging. There should be little difference, if any, between checking if an unboxed int is 1 and checking the pointer tag.
Let's reason about the implementations to see what we can expect.
Analysis
We consider three implementations:
simple: The simple recursive implementationnotip: The one that never recurses onTip(my code above)bin1: The one matching onBin 1 _ Tip Tip(current foldMap and this PR)
Let's consider two classes of trees, each an extreme case for leaves.
- With balanced leaf
Bins.
Assume there are┌───────────B───────────┐ ┌─────B─────┐ ┌─────B─────┐ ┌──B──┐ ┌──B──┐ ┌──B──┐ ┌──B──┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ T T T T T T T T T T T T T T T TnleafBins. Then there aren-1internalBins and2nTips. - With skewed leaf
Bins.
Assume there are two layers of┌───────────B───────────┐ ┌─────B─────┐ ┌─────B─────┐ ┌──B──┐ ┌──B──┐ ┌──B──┐ ┌──B──┐ ┌─B─┐ T ┌─B─┐ T ┌─B─┐ T ┌─B─┐ T T T T T T T T TnBins at the lowest levels. Then there aren-1internalBins and3nTips.
Now we track metrics that might contribute to the cost.
- Number of recursive function calls (
nrec) - Number of times we match on
Tip(ntip) - Number of times we compare an
Int(nint) - Only for
foldMap-number of_ <> memptyormempty <> _(nmty)
For case 1 we have
| nrec | ntip | nint | nmty | |
|---|---|---|---|---|
| simple | 4n-1 | 2n | 0 | 2n |
| notip | 2n-1 | 2n | 0 | 0 |
| bin1 | 2n-1 | 0 | 2n-1 | 0 |
For case 2 we have
| nrec | ntip | nint | nmty | |
|---|---|---|---|---|
| simple | 6n-1 | 3n | 0 | 3n |
| notip | 3n-1 | 3n | 0 | 0 |
| bin1 | 4n-1 | n | 3n-1 | n |
We can see
bin1costs 0 (case 1) to 33% (case 2) morenreccompared tonotip.bin1costs 0 (case 1) to n (case 2)nmtywhereasnotipalways costs 0.bin1reducesntipbut costsnint. For this to pay off,nintmust be sufficiently cheaper thanntip. But if we assumentipandnintcosts the same,bin1costs 0 (case 1) to 33% (case 2) morentip+nintcompared tonotip.
We definitely cannot conclude that bin1 is always the best strategy.
Benchmarks
I benchmarked the implementations on a simple fold (equivalent to length).
Note: I'm using unboxed ints because I want to avoid heap allocations and GHC doesn't seem smart enough to unbox automatically in this case.
simple :: Set a -> Int#
simple (Bin _ _ l r) = simple l +# (1# +# simple r)
simple Tip = 0#
notip :: Set a -> Int#
notip Tip = 0#
notip (Bin sz x l r) = go sz x l r
where
go !_sz !_x l r = case l of
Bin lsz lx ll lr -> case r of
Bin rsz rx rl rr -> go lsz lx ll lr +# (1# +# go rsz rx rl rr)
Tip -> go lsz lx ll lr +# 1#
Tip -> case r of
Bin rsz rx rl rr -> 1# +# go rsz rx rl rr
Tip -> 1#
bin1 :: Set a -> Int#
bin1 (Bin 1 _ _ _) = 1#
bin1 (Bin _ _ l r) = bin1 l +# (1# +# bin1 r)
bin1 Tip = 0#
Timings on my machine with GHC 9.10.1:
depth6
case1
simple: OK
283 ns ± 25 ns
notip: OK
159 ns ± 10 ns
bin1: OK
176 ns ± 12 ns
case2
simple: OK
232 ns ± 22 ns
notip: OK
115 ns ± 11 ns
bin1: OK
164 ns ± 11 ns
depth12
case1
simple: OK
17.9 μs ± 381 ns
notip: OK
9.93 μs ± 765 ns
bin1: OK
11.0 μs ± 802 ns
case2
simple: OK
14.6 μs ± 1.3 μs
notip: OK
6.51 μs ± 253 ns
bin1: OK
10.4 μs ± 746 ns
depth18
case1
simple: OK
1.31 ms ± 55 μs
notip: OK
828 μs ± 52 μs
bin1: OK
817 μs ± 53 μs
case2
simple: OK
1.00 ms ± 85 μs
notip: OK
466 μs ± 32 μs
bin1: OK
695 μs ± 66 μs
Full source code
{- cabal:
build-depends: base, deepseq, tasty-bench
-}
{-# LANGUAGE BangPatterns, MagicHash #-}
{-# OPTIONS_GHC -O2 -fproc-alignment=64 #-}
{-# OPTIONS_GHC -ddump-simpl -ddump-cmm -ddump-to-file #-}
import Control.DeepSeq
import System.IO
import Data.Bits
import Test.Tasty.Bench
import GHC.Exts
main :: IO ()
main = do
defaultMain
[ bgroup ("depth" <> show d)
[ (env (pure (make1 d))) $ \t -> bgroup "case1"
[ bench "simple" $ whnf simple' t
, bench "notip" $ whnf notip' t
, bench "bin1" $ whnf bin1' t
]
, (env (pure (make2 d))) $ \t -> bgroup "case2"
[ bench "simple" $ whnf simple' t
, bench "notip" $ whnf notip' t
, bench "bin1" $ whnf bin1' t
]
]
| d <- [6, 12, 18]
]
---------------
data Set a = Bin !Int !a !(Set a) !(Set a) | Tip deriving Show
instance NFData a => NFData (Set a) where
rnf = rwhnf -- good enough for Set Int
make1 :: Int -> Set Int
make1 d = go 0 d
where
-- d >= 1
go !off 1 = Bin 1 off Tip Tip
go off d = Bin sz x (go off (d-1)) (go (x+1) (d-1))
where
lsz = bit (d-1) - 1
x = off + lsz + 1
sz = 1 + lsz + lsz
make2 :: Int -> Set Int
make2 d = go 0 d
where
-- d >= 2
go !off 2 = Bin 2 (off+1) (Bin 1 off Tip Tip) Tip
go off d = Bin sz x (go off (d-1)) (go (x+1) (d-1))
where
lsz = bit (d-3) * 3 - 1
x = off + lsz + 1
sz = 1 + lsz + lsz
---------------
wrap :: (Set a -> Int#) -> Set a -> ()
wrap f s = case f s of _ -> ()
simple' = wrap simple
simple :: Set a -> Int#
simple (Bin _ _ l r) = simple l +# (1# +# simple r)
simple Tip = 0#
notip' = wrap notip
notip :: Set a -> Int#
notip Tip = 0#
notip (Bin sz x l r) = go sz x l r
where
go !_sz !_x l r = case l of
Bin lsz lx ll lr -> case r of
Bin rsz rx rl rr -> go lsz lx ll lr +# (1# +# go rsz rx rl rr)
Tip -> go lsz lx ll lr +# 1#
Tip -> case r of
Bin rsz rx rl rr -> 1# +# go rsz rx rl rr
Tip -> 1#
bin1' = wrap bin1
bin1 :: Set a -> Int#
bin1 (Bin 1 _ _ _) = 1#
bin1 (Bin _ _ l r) = bin1 l +# (1# +# bin1 r)
bin1 Tip = 0#
The results don't disagree with the analysis.
- For case 1,
notipandbin1are pretty much equally fast. - For case 2,
bin1takes 40-50% more time thannotip.
I haven't yet thought about what we should do based on this information, but I hope the analysis and benchmarks are reasonable.
You need to add an UNPACK pragma.
data Set a = Bin {-# UNPACK #-} !Int !a !(Set a) !(Set a) | Tip deriving Show
Edit: removed incorrect numbers.
I'm definitely in agreement with the images, I've considered the same things.
With regards to benchmarking, I wrote up different versions of foldMap for Data.Map proper and used the benchmark suite included.
Without an UNPACK pragma, there's an extra pointer chase.
(I think the last line in the table should be 4n-1, n, 3n-1, n).
You need to add an UNPACK pragma.
It's not necessary, because -funbox-small-strict-fields is enabled by -O.
(I think the last line in the table should be 4n-1, n, 3n-1, n).
True, fixed!
With regards to benchmarking, I wrote up different versions of foldMap for
Data.Mapproper and used the benchmark suite included.
But there are no foldMap benchmarks in https://github.com/haskell/containers/blob/master/containers-tests/benchmarks/Map.hs ?
Regarding your numbers, the allocations metric is surprising. There should be zero allocations. Which I can confirm when I run my file above with +RTS -T. Do check that you're running with -O or -O2.
Oh yeah, I added one for testing, it I just imported Data.Monoid and use Sum.
Hmm, my numbers above weren't run with all optimisations on.
Checking if the size is 1 takes one comparison and one conditional branch. Checking if both children are Tip requires two comparisons and two conditional branches. I would assume the comparisons are essentially free. Does the extra branch tie up branch prediction circuits?
Hmm. I'm getting significant differences with different versions of GHC.
I'm going to take some time putting together some more numbers (and maybe try to stop doing them on a laptop).
For posterity:
defaultMain
[ bench "foldMap" $ nf (foldMap Sum) m
, bench "foldMapWithKey" $ nf (M.foldMapWithKey (const Sum)) m
]
So I performed my initial analysis using GHC 8.10.7, where the bin 1 trick was the fastest.
GHC 8.10.7
All
foldMapWithKeySimple: OK
24.4 μs ± 2.2 μs
foldMapWithKeyBin1: OK
18.6 μs ± 1.7 μs
foldMapWithKeyNoTip: OK
21.9 μs ± 1.2 μs
But, in the 9 series, it looks like it has reversed (and the naïve case has improved too, it's closer to the bin 1 check case).
GHC 9.10.1
All
foldMapWithKeySimple: OK
19.4 μs ± 848 ns
foldMapWithKeyBin1: OK
18.8 μs ± 1.6 μs
foldMapWithKeyNoTip: OK
13.1 μs ± 766 ns
This is using the benchmarks suite the the following additions:
foldMapWithKey :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKey f = go
where
go Tip = mempty
go (Bin 1 k v _ _) = f k v
go (Bin _ k v l r) = go l `mappend` (f k v `mappend` go r)
{-# INLINE foldMapWithKey #-}
foldMapWithKeySimple :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeySimple f = go
where
go Tip = mempty
go (Bin _ k v l r) = go l `mappend` (f k v `mappend` go r)
{-# INLINE foldMapWithKeySimple #-}
foldMapWithKeyNoTip :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeyNoTip _ Tip = mempty
foldMapWithKeyNoTip f (Bin _ k0 x0 l0 r0) = go k0 x0 l0 r0
where
go k x l r = case (l, r) of
(Bin _ lk lx ll lr, Bin _ rk rx rl rr) -> go lk lx ll lr <> f k x <> go rk rx rl rr
(Bin _ lk lx ll lr, Tip) -> go lk lx ll lr <> f k x
(Tip, Bin _ rk rx rl rr) -> f k x <> go rk rx rl rr
(Tip, Tip) -> f k x
{-# INLINE foldMapWithKeyNoTip #-}
defaultMain
[ bench "foldMapWithKeySimple" $ nf (M.foldMapWithKeySimple (const Sum)) m
, bench "foldMapWithKeyBin1" $ nf (M.foldMapWithKey (const Sum)) m
, bench "foldMapWithKeyNoTip" $ nf (M.foldMapWithKeyNoTip (const Sum)) m
]
I'll redo the filter, map, mapMaybe benchmarks on a latter GHC and report back later.
These are my numbers with GHC 9.10.
The "simple" cases are those in master currently, which don't use the bin 1 optimisation.
For foldMap, I think we should probably go for using the "no tip" version. It's much faster with a modern GHC.
For map, filter, foldl', and mapMaybe, I think using the Bin 1 optimisation is worth it, it's such a small amount of additional code (and is simple), and the improvements are pretty substantial.
What do you think?
foldMapWithKeySimple: OK
19.9 μs ± 1.8 μs
foldMapWithKeyBin1: OK
18.4 μs ± 1.4 μs
foldMapWithKeyNoTip: OK
13.6 μs ± 742 ns
map: OK
32.3 μs ± 2.9 μs
map simple: OK
41.6 μs ± 2.0 μs
filter: OK
47.4 μs ± 2.0 μs
filterSimple: OK
58.5 μs ± 5.4 μs
foldlWithKey': OK
11.5 μs ± 898 ns
foldlWithKey'simple: OK
15.8 μs ± 1.5 μs
mapMaybeWithKey: OK
99.5 μs ± 5.2 μs
mapMaybeWithKeySimple: OK
109 μs ± 7.4 μs
Which is the "no tip" version?
No Tip is Soumik's alternate traversal above.
foldMapWithKeyNoTip :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeyNoTip _ Tip = mempty
foldMapWithKeyNoTip f (Bin _ k0 x0 l0 r0) = go k0 x0 l0 r0
where
go k x l r = case (l, r) of
(Bin _ lk lx ll lr, Bin _ rk rx rl rr) -> go lk lx ll lr <> f k x <> go rk rx rl rr
(Bin _ lk lx ll lr, Tip) -> go lk lx ll lr <> f k x
(Tip, Bin _ rk rx rl rr) -> f k x <> go rk rx rl rr
(Tip, Tip) -> f k x
{-# INLINE foldMapWithKeyNoTip #-}
Prior to GHC 9.4, the Bin 1 technique is faster for foldMap.
It seems to me that GHC 9.4's improvements to pointer tagging (tag inference in particular) make a huge improvement in this case.
GHC 9.2.4
foldMapWithKeySimple: OK
24.6 μs ± 2.0 μs
foldMapWithKeyBin1: OK
17.7 μs ± 1.0 μs
foldMapWithKeyNoTip: OK
19.4 μs ± 607 ns
GHC 9.4.7
foldMapWithKeySimple: OK
19.7 μs ± 924 ns
foldMapWithKeyBin1: OK
18.4 μs ± 870 ns
foldMapWithKeyNoTip: OK
13.0 μs ± 715 ns
It seems plausible that rethinking more traversals along these lines could work well, but they are more complex, and my attempts with foldl' were not fruitful (2x slower than the naïve case, let alone the bin 1 version).
So I think just taking the 25% improvements with the Bin 1 trick for map and filter and getting full benefit of GHC's improvements on foldMap seems like a good compromise.
In terms of this PR, all functions which are touched are improvements (and it doesn't touch foldMap anyway).
I did some analysis of the maps used in the benchmark suite. Because they use the optimised ascending functions they produce extremely balances trees... only 1 node has a single child and 50% match the Bin 1 condition.
I'm going to test these metrics against shuffled inputs. But it might be all bullshit (including the initial use of it?).
I'm sorry guys, this has been a real roundabout.
I'm sorry guys, this has been a real roundabout.
More experimentation is good! Don't worry about it.
I did some analysis of the maps used in the benchmark suite. Because they use the optimised ascending functions they produce extremely balances trees... only 1 node has a single child and 50% match the Bin 1 condition.
I'm going to test these metrics against shuffled inputs.
Sounds good. I think we should update our Set/Map benchmarks to run on randomly structured trees instead of the very specific fromList [1..n] wherever appropriate. It would be more representative of an "average" case.
Creating the Set/Map from a shuffled list should do it, as you said. I can make a PR for it this weekend. If you want to beat me to it, go for it. I added a few random tests in #1056, some of that code will be useful.
Just a drive-by comment: If this general path ends up being a win overall, then it may be worth considering a redesign of the datatype to avoid/reduce the occurrence of Tip in the first place. (Of course, doing so would require a whole lot of benchmarking to make sure to avoid regressions elsewhere.) I did a lot of that sort of datatype redesign when developing my bytestring-trie, and it can yield significant wins so long as the algorithms can avoid a proliferation of special-case branching
That's a good idea, I've opened #1073 to track it.