containers Map performance - use pattern match on size to reduce recursive function calls

trafficstars

This change incorporates the idea already used for foldMap and traverse, where we pattern match on the size of "leaf nodes". If the size is 1, we already know both sides are tips, and we don't need to recurse.

Of particular interest is mapMaybe, where we can also omit the extra function calls to link.

Benchmarks show improvements around 10 to 20% for most operations affected, though foldl' seems closer to 30%.

❯ ./map-benchmarks-with-these-changes
All
  map:             OK
    53.6 μs ± 3.8 μs
  map really:      OK
    128  μs ± 7.5 μs
  mapWithKey:      OK
    66.4 μs ± 4.1 μs
  foldlWithKey:    OK
    796  μs ±  36 μs
  foldlWithKey':   OK
    17.7 μs ± 909 ns
  foldrWithKey:    OK
    61.1 ns ± 4.7 ns
  foldrWithKey':   OK
    34.9 μs ± 3.4 μs
  mapMaybe:        OK
    112  μs ± 7.9 μs
  mapMaybeWithKey: OK
    111  μs ± 2.4 μs
  filter:        OK
    49.5 μs ± 4.3 μs
  filter really: OK
    70.5 μs ± 4.1 μs

❯ ./map-benchmarks-as-per-master    
All
  map:             OK
    65.0 μs ± 5.5 μs
  map really:      OK
    122  μs ±  10 μs
  mapWithKey:      OK
    76.1 μs ± 6.6 μs
  foldlWithKey:    OK
    821  μs ±  30 μs
  foldlWithKey':   OK
    24.9 μs ± 1.5 μs
  foldrWithKey:    OK
    64.5 ns ± 4.0 ns
  foldrWithKey':   OK
    37.2 μs ± 1.8 μs
  mapMaybe:        OK
    134  μs ± 8.1 μs
  mapMaybeWithKey: OK
    131  μs ± 7.3 μs
  filter:        OK
    62.0 μs ± 5.7 μs
  filter really: OK
    81.0 μs ± 4.7 μs

Some (very) cursory analysis says that about 1/3 of Bin nodes in a tree should have no children, hence changes up to 30% do seem plausible.

Pattern matching on a strict and unboxed Int should be very fast.

Nov 13 '24 23:11 HuwCampbell

Thanks for testing this out. The improvements seem decent but I've always found this undocumented special case strange and I'm not sure we want to proliferate this pattern. If we really want to avoid visiting Tips why not do it properly? For foldMap this can look like

foldMap :: Monoid m => (a -> m) -> Set a -> m
foldMap _ Tip = mempty
foldMap f (Set _ x l r) = go x l r
  where
    go x l r = case (l, r) of
      (Bin _ lx ll lr, Bin _ rx rl rr) -> go lx ll lr <> f x <> go rx rl rr
      (Bin _ lx ll lr, Tip) -> go lx ll lr <> f x
      (Tip, Bin rx rl rr) -> f x <> go rx rl rr
      (Tip, Tip) -> f x

But I haven't measured the effects of such a change.

Nov 14 '24 16:11 meooow25

The version you posted is about 10 to 15% slower than what's there already. See this commit which adds the change it to foldMap.

It's worth asking why it's slower – it seems that pattern matching on the strict packed integer should be very fast, as there's no pointer chasing or indirection involved, pattern matching on the tips does require some indirection.

From the comment in the link above, eliding 2/3 of the recursive calls to Tip is a big deal, because there are a lot of Tips, effectively about the same number as the number of items in the map.

Maybe we should add a GHC style note about this optimisation.

Edit: I've written one.

Nov 14 '24 23:11 HuwCampbell

The version you posted is about 10 to 15% slower than what's there already.

Can you share more on this? This cannot be universally true, there are trade-offs here. It also disagrees with my benchmarks below.

...pattern matching on the tips does require some indirection.

It does not, thanks to GHC's pointer tagging. There should be little difference, if any, between checking if an unboxed int is 1 and checking the pointer tag.

Let's reason about the implementations to see what we can expect.

Analysis

We consider three implementations:

simple: The simple recursive implementation
notip: The one that never recurses on Tip (my code above)
bin1: The one matching on Bin 1 _ Tip Tip (current foldMap and this PR)

Let's consider two classes of trees, each an extreme case for leaves.

With balanced leaf Bins.

           ┌───────────B───────────┐
     ┌─────B─────┐           ┌─────B─────┐
  ┌──B──┐     ┌──B──┐     ┌──B──┐     ┌──B──┐
┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐ ┌─B─┐
T   T T   T T   T T   T T   T T   T T   T T   T

Assume there are n leaf Bins. Then there are n-1 internal Bins and 2n Tips.

With skewed leaf Bins.

           ┌───────────B───────────┐
     ┌─────B─────┐           ┌─────B─────┐
  ┌──B──┐     ┌──B──┐     ┌──B──┐     ┌──B──┐
┌─B─┐   T   ┌─B─┐   T   ┌─B─┐   T   ┌─B─┐   T  
T   T       T   T       T   T       T   T

Assume there are two layers of n Bins at the lowest levels. Then there are n-1 internal Bins and 3n Tips.

Now we track metrics that might contribute to the cost.

Number of recursive function calls (nrec)
Number of times we match on Tip (ntip)
Number of times we compare an Int (nint)
Only for foldMap-number of _ <> mempty or mempty <> _ (nmty)

For case 1 we have

	nrec	ntip	nint	nmty
simple	4n-1	2n	0	2n
notip	2n-1	2n	0	0
bin1	2n-1	0	2n-1	0

For case 2 we have

	nrec	ntip	nint	nmty
simple	6n-1	3n	0	3n
notip	3n-1	3n	0	0
bin1	4n-1	n	3n-1	n

We can see

bin1 costs 0 (case 1) to 33% (case 2) more nrec compared to notip.
bin1 costs 0 (case 1) to n (case 2) nmty whereas notip always costs 0.
bin1 reduces ntip but costs nint. For this to pay off, nint must be sufficiently cheaper than ntip. But if we assume ntip and nint costs the same, bin1 costs 0 (case 1) to 33% (case 2) more ntip+nint compared to notip.

We definitely cannot conclude that bin1 is always the best strategy.

Benchmarks

I benchmarked the implementations on a simple fold (equivalent to length). Note: I'm using unboxed ints because I want to avoid heap allocations and GHC doesn't seem smart enough to unbox automatically in this case.

simple :: Set a -> Int#
simple (Bin _ _ l r) = simple l +# (1# +# simple r)
simple Tip = 0#

notip :: Set a -> Int#
notip Tip = 0#
notip (Bin sz x l r) = go sz x l r
  where
    go !_sz !_x l r = case l of
      Bin lsz lx ll lr -> case r of
        Bin rsz rx rl rr -> go lsz lx ll lr +# (1# +# go rsz rx rl rr)
        Tip -> go lsz lx ll lr +# 1#
      Tip -> case r of
        Bin rsz rx rl rr -> 1# +# go rsz rx rl rr
        Tip -> 1#

bin1 :: Set a -> Int#
bin1 (Bin 1 _ _ _) = 1#
bin1 (Bin _ _ l r) = bin1 l +# (1# +# bin1 r)
bin1 Tip = 0#

Timings on my machine with GHC 9.10.1:

  depth6
    case1
      simple: OK
        283  ns ±  25 ns
      notip:  OK
        159  ns ±  10 ns
      bin1:   OK
        176  ns ±  12 ns
    case2
      simple: OK
        232  ns ±  22 ns
      notip:  OK
        115  ns ±  11 ns
      bin1:   OK
        164  ns ±  11 ns
  depth12
    case1
      simple: OK
        17.9 μs ± 381 ns
      notip:  OK
        9.93 μs ± 765 ns
      bin1:   OK
        11.0 μs ± 802 ns
    case2
      simple: OK
        14.6 μs ± 1.3 μs
      notip:  OK
        6.51 μs ± 253 ns
      bin1:   OK
        10.4 μs ± 746 ns
  depth18
    case1
      simple: OK
        1.31 ms ±  55 μs
      notip:  OK
        828  μs ±  52 μs
      bin1:   OK
        817  μs ±  53 μs
    case2
      simple: OK
        1.00 ms ±  85 μs
      notip:  OK
        466  μs ±  32 μs
      bin1:   OK
        695  μs ±  66 μs

Full source code

{- cabal:
build-depends: base, deepseq, tasty-bench
-}
{-# LANGUAGE BangPatterns, MagicHash #-}
{-# OPTIONS_GHC -O2 -fproc-alignment=64 #-}
{-# OPTIONS_GHC -ddump-simpl -ddump-cmm -ddump-to-file #-}

import Control.DeepSeq
import System.IO
import Data.Bits
import Test.Tasty.Bench
import GHC.Exts

main :: IO ()
main = do
  defaultMain
    [ bgroup ("depth" <> show d)
      [ (env (pure (make1 d))) $ \t -> bgroup "case1"
        [ bench "simple" $ whnf simple' t
        , bench "notip" $ whnf notip' t
        , bench "bin1" $ whnf bin1' t
        ]
      , (env (pure (make2 d))) $ \t -> bgroup "case2"
        [ bench "simple" $ whnf simple' t
        , bench "notip" $ whnf notip' t
        , bench "bin1" $ whnf bin1' t
        ]
      ]
    | d <- [6, 12, 18]
    ]

---------------

data Set a = Bin !Int !a !(Set a) !(Set a) | Tip deriving Show

instance NFData a => NFData (Set a) where
  rnf = rwhnf -- good enough for Set Int

make1 :: Int -> Set Int
make1 d = go 0 d
  where
    -- d >= 1
    go !off 1 = Bin 1 off Tip Tip
    go off d = Bin sz x (go off (d-1)) (go (x+1) (d-1))
      where
        lsz = bit (d-1) - 1
        x = off + lsz + 1
        sz = 1 + lsz + lsz

make2 :: Int -> Set Int
make2 d = go 0 d
  where
    -- d >= 2
    go !off 2 = Bin 2 (off+1) (Bin 1 off Tip Tip) Tip
    go off d = Bin sz x (go off (d-1)) (go (x+1) (d-1))
      where
        lsz = bit (d-3) * 3 - 1
        x = off + lsz + 1
        sz = 1 + lsz + lsz

---------------

wrap :: (Set a -> Int#) -> Set a -> ()
wrap f s = case f s of _ -> ()

simple' = wrap simple

simple :: Set a -> Int#
simple (Bin _ _ l r) = simple l +# (1# +# simple r)
simple Tip = 0#

notip' = wrap notip

notip :: Set a -> Int#
notip Tip = 0#
notip (Bin sz x l r) = go sz x l r
  where
    go !_sz !_x l r = case l of
      Bin lsz lx ll lr -> case r of
        Bin rsz rx rl rr -> go lsz lx ll lr +# (1# +# go rsz rx rl rr)
        Tip -> go lsz lx ll lr +# 1#
      Tip -> case r of
        Bin rsz rx rl rr -> 1# +# go rsz rx rl rr
        Tip -> 1#

bin1' = wrap bin1

bin1 :: Set a -> Int#
bin1 (Bin 1 _ _ _) = 1#
bin1 (Bin _ _ l r) = bin1 l +# (1# +# bin1 r)
bin1 Tip = 0#

The results don't disagree with the analysis.

For case 1, notip and bin1 are pretty much equally fast.
For case 2, bin1 takes 40-50% more time than notip.

I haven't yet thought about what we should do based on this information, but I hope the analysis and benchmarks are reasonable.

Nov 16 '24 23:11 meooow25

You need to add an UNPACK pragma.

data Set a = Bin {-# UNPACK #-} !Int !a !(Set a) !(Set a) | Tip deriving Show

Edit: removed incorrect numbers.

Nov 16 '24 23:11 HuwCampbell

I'm definitely in agreement with the images, I've considered the same things.

With regards to benchmarking, I wrote up different versions of foldMap for Data.Map proper and used the benchmark suite included.

Without an UNPACK pragma, there's an extra pointer chase.

(I think the last line in the table should be 4n-1, n, 3n-1, n).

Nov 17 '24 00:11 HuwCampbell

You need to add an UNPACK pragma.

It's not necessary, because -funbox-small-strict-fields is enabled by -O.

(I think the last line in the table should be 4n-1, n, 3n-1, n).

True, fixed!

With regards to benchmarking, I wrote up different versions of foldMap for Data.Map proper and used the benchmark suite included.

But there are no foldMap benchmarks in https://github.com/haskell/containers/blob/master/containers-tests/benchmarks/Map.hs ?

Regarding your numbers, the allocations metric is surprising. There should be zero allocations. Which I can confirm when I run my file above with +RTS -T. Do check that you're running with -O or -O2.

Nov 17 '24 00:11 meooow25

Oh yeah, I added one for testing, it I just imported Data.Monoid and use Sum.

Hmm, my numbers above weren't run with all optimisations on.

Nov 17 '24 00:11 HuwCampbell

Checking if the size is 1 takes one comparison and one conditional branch. Checking if both children are Tip requires two comparisons and two conditional branches. I would assume the comparisons are essentially free. Does the extra branch tie up branch prediction circuits?

Nov 17 '24 00:11 treeowl

Hmm. I'm getting significant differences with different versions of GHC.

I'm going to take some time putting together some more numbers (and maybe try to stop doing them on a laptop).

For posterity:

    defaultMain
        [ bench "foldMap" $ nf (foldMap Sum) m
        , bench "foldMapWithKey" $ nf (M.foldMapWithKey (const Sum)) m
        ]

Nov 17 '24 01:11 HuwCampbell

So I performed my initial analysis using GHC 8.10.7, where the bin 1 trick was the fastest.

GHC 8.10.7

All
  foldMapWithKeySimple: OK
    24.4 μs ± 2.2 μs
  foldMapWithKeyBin1:    OK
    18.6 μs ± 1.7 μs
  foldMapWithKeyNoTip:  OK
    21.9 μs ± 1.2 μs

But, in the 9 series, it looks like it has reversed (and the naïve case has improved too, it's closer to the bin 1 check case).

GHC 9.10.1

All
  foldMapWithKeySimple: OK
    19.4 μs ± 848 ns
  foldMapWithKeyBin1:   OK
    18.8 μs ± 1.6 μs
  foldMapWithKeyNoTip:  OK
    13.1 μs ± 766 ns

This is using the benchmarks suite the the following additions:

foldMapWithKey :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKey f = go
  where
    go Tip             = mempty
    go (Bin 1 k v _ _) = f k v
    go (Bin _ k v l r) = go l `mappend` (f k v `mappend` go r)
{-# INLINE foldMapWithKey #-}

foldMapWithKeySimple :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeySimple f = go
  where
    go Tip             = mempty
    go (Bin _ k v l r) = go l `mappend` (f k v `mappend` go r)
{-# INLINE foldMapWithKeySimple #-}

foldMapWithKeyNoTip :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeyNoTip _ Tip = mempty
foldMapWithKeyNoTip f (Bin _ k0 x0 l0 r0) = go k0 x0 l0 r0
  where
    go k x l r = case (l, r) of
      (Bin _ lk lx ll lr, Bin _ rk rx rl rr) -> go lk lx ll lr <> f k x <> go rk rx rl rr
      (Bin _ lk lx ll lr, Tip) -> go lk lx ll lr <> f k x
      (Tip, Bin _ rk rx rl rr) -> f k x <> go rk rx rl rr
      (Tip, Tip) -> f k x
{-# INLINE foldMapWithKeyNoTip #-}

    defaultMain
        [ bench "foldMapWithKeySimple" $ nf (M.foldMapWithKeySimple (const Sum)) m
        , bench "foldMapWithKeyBin1" $ nf (M.foldMapWithKey (const Sum)) m
        , bench "foldMapWithKeyNoTip" $ nf (M.foldMapWithKeyNoTip (const Sum)) m
        ]

Nov 17 '24 02:11 HuwCampbell

I'll redo the filter, map, mapMaybe benchmarks on a latter GHC and report back later.

Nov 17 '24 02:11 HuwCampbell

These are my numbers with GHC 9.10.

The "simple" cases are those in master currently, which don't use the bin 1 optimisation.

For foldMap, I think we should probably go for using the "no tip" version. It's much faster with a modern GHC.

For map, filter, foldl', and mapMaybe, I think using the Bin 1 optimisation is worth it, it's such a small amount of additional code (and is simple), and the improvements are pretty substantial.

What do you think?

  foldMapWithKeySimple:  OK
    19.9 μs ± 1.8 μs
  foldMapWithKeyBin1:    OK
    18.4 μs ± 1.4 μs
  foldMapWithKeyNoTip:   OK
    13.6 μs ± 742 ns
  map:                   OK
    32.3 μs ± 2.9 μs
  map simple:            OK
    41.6 μs ± 2.0 μs
  filter:                OK
    47.4 μs ± 2.0 μs
  filterSimple:          OK
    58.5 μs ± 5.4 μs
  foldlWithKey':         OK
    11.5 μs ± 898 ns
  foldlWithKey'simple:   OK
    15.8 μs ± 1.5 μs
  mapMaybeWithKey:       OK
    99.5 μs ± 5.2 μs
  mapMaybeWithKeySimple: OK
    109  μs ± 7.4 μs

Nov 17 '24 07:11 HuwCampbell

Which is the "no tip" version?

Nov 17 '24 15:11 treeowl

No Tip is Soumik's alternate traversal above.

foldMapWithKeyNoTip :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeyNoTip _ Tip = mempty
foldMapWithKeyNoTip f (Bin _ k0 x0 l0 r0) = go k0 x0 l0 r0
  where
    go k x l r = case (l, r) of
      (Bin _ lk lx ll lr, Bin _ rk rx rl rr) -> go lk lx ll lr <> f k x <> go rk rx rl rr
      (Bin _ lk lx ll lr, Tip) -> go lk lx ll lr <> f k x
      (Tip, Bin _ rk rx rl rr) -> f k x <> go rk rx rl rr
      (Tip, Tip) -> f k x
{-# INLINE foldMapWithKeyNoTip #-}

Prior to GHC 9.4, the Bin 1 technique is faster for foldMap.

It seems to me that GHC 9.4's improvements to pointer tagging (tag inference in particular) make a huge improvement in this case.

GHC 9.2.4

  foldMapWithKeySimple:  OK
    24.6 μs ± 2.0 μs
  foldMapWithKeyBin1:    OK
    17.7 μs ± 1.0 μs
  foldMapWithKeyNoTip:   OK
    19.4 μs ± 607 ns

GHC 9.4.7

  foldMapWithKeySimple:  OK
    19.7 μs ± 924 ns
  foldMapWithKeyBin1:    OK
    18.4 μs ± 870 ns
  foldMapWithKeyNoTip:   OK
    13.0 μs ± 715 ns

It seems plausible that rethinking more traversals along these lines could work well, but they are more complex, and my attempts with foldl' were not fruitful (2x slower than the naïve case, let alone the bin 1 version).

So I think just taking the 25% improvements with the Bin 1 trick for map and filter and getting full benefit of GHC's improvements on foldMap seems like a good compromise.

In terms of this PR, all functions which are touched are improvements (and it doesn't touch foldMap anyway).

Nov 17 '24 22:11 HuwCampbell

I did some analysis of the maps used in the benchmark suite. Because they use the optimised ascending functions they produce extremely balances trees... only 1 node has a single child and 50% match the Bin 1 condition.

I'm going to test these metrics against shuffled inputs. But it might be all bullshit (including the initial use of it?).

I'm sorry guys, this has been a real roundabout.

Nov 20 '24 07:11 HuwCampbell

I'm sorry guys, this has been a real roundabout.

More experimentation is good! Don't worry about it.

I did some analysis of the maps used in the benchmark suite. Because they use the optimised ascending functions they produce extremely balances trees... only 1 node has a single child and 50% match the Bin 1 condition.

I'm going to test these metrics against shuffled inputs.

Sounds good. I think we should update our Set/Map benchmarks to run on randomly structured trees instead of the very specific fromList [1..n] wherever appropriate. It would be more representative of an "average" case.
Creating the Set/Map from a shuffled list should do it, as you said. I can make a PR for it this weekend. If you want to beat me to it, go for it. I added a few random tests in #1056, some of that code will be useful.

Nov 20 '24 14:11 meooow25

Just a drive-by comment: If this general path ends up being a win overall, then it may be worth considering a redesign of the datatype to avoid/reduce the occurrence of Tip in the first place. (Of course, doing so would require a whole lot of benchmarking to make sure to avoid regressions elsewhere.) I did a lot of that sort of datatype redesign when developing my bytestring-trie, and it can yield significant wins so long as the algorithms can avoid a proliferation of special-case branching

Nov 27 '24 22:11 wrengr

That's a good idea, I've opened #1073 to track it.

Nov 28 '24 13:11 meooow25

containers containers copied to clipboard

Map performance - use pattern match on size to reduce recursive function calls

Analysis

Benchmarks

containers
containers copied to clipboard