containers Faster Eq and Ord

The following operations are implemented by converting to lists and comparing.

[x] Set == (improved in #1017)
[x] Set compare (same as above)
[x] Map == (same as above)
[x] Map compare (same as above)
[ ] IntSet compare
[ ] IntMap compare
[x] Seq == (#1035)
[x] Seq compare (same as above)

This is not the best we can do for a couple of reasons:

No specialization. Eq and Ord for lists don't specialize (but there a few manually specified exceptions).
No list fusion. We materialize both lists.

See list instances Eq, Ord.

I tried a couple of alternatives and saw decent improvements

  Set Word
    compare: OK
      3.03 ms ± 189 μs,  11 MB allocated, 1.8 MB copied,  23 MB peak memory
    setCmp:  OK
      1.64 ms ± 143 μs, 8.4 MB allocated, 583 KB copied,  23 MB peak memory, 0.54x
    setCmp2: OK
      947  μs ±  68 μs, 5.3 MB allocated, 795 B  copied,  23 MB peak memory, 0.31x

-- Current version but fused with the first set
setCmp :: Ord a => Set a -> Set a -> Ordering
setCmp xs ys0 = S.foldr f z xs (S.toList ys0)
  where
    f x k = oneShot $ \ys -> case ys of
      [] -> GT
      y:ys' -> compare x y <> k ys'
    z ys = case ys of
      [] -> EQ
      _ -> LT
{-# INLINABLE setCmp #-}

-- Somewhat like a traverse with StateT [a] (Either Bool)
setCmp2 :: Ord a => Set a -> Set a -> Ordering
setCmp2 xs ys0 = case go xs (S.toList ys0) of
  WithO o ys' ->
    o <> case ys' of
      [] -> EQ
      _ -> LT
  where
    go Tip ys = WithO EQ ys
    go (Bin _ x l r) ys = go l ys >>- f >>- go r
      where
        f [] = WithO GT []
        f (y:ys') = WithO (compare x y) ys'
    ox@(WithO o x) >>- f = case o of
      EQ -> f x
      _ -> ox
{-# INLINABLE setCmp2 #-}

data WithO a = WithO !Ordering a

Benchmark code

import Test.Tasty.Bench
import qualified Data.Set as S
import Data.Set.Internal (Set(..))
import GHC.Exts (oneShot)

main :: IO ()
main = defaultMain
  [ env (pure (S.fromList xs_)) $ \xs -> bgroup "Set Word"
    [ bench "eq" $ whnf (\ys -> ys == ys) xs
    , bench "compare" $ whnf (\ys -> compare ys ys) xs
    , bcompare "$0 == \"All.Set.eq\"" $ bench "eq new" $ whnf (\ys -> setEq ys ys) xs
    , bcompare "$0 == \"All.Set.compare\"" $ bench "setCmp" $ whnf (\ys -> setCmp ys ys) xs
    , bcompare "$0 == \"All.Set.compare\"" $ bench "setCmp2" $ whnf (\ys -> setCmp2 ys ys) xs
    ]
  ]
  where
    xs_ = [1..100000] :: [Word]

setEq :: Eq a => Set a -> Set a -> Bool
setEq xs ys0 = S.foldr f z xs (S.toList ys0)
  where
    f x k = oneShot $ \ys -> case ys of
      [] -> False
      y:ys' -> x == y && k ys'
    z ys = case ys of
      [] -> True
      _ -> False

-- Current version but fused with the first set
setCmp :: Ord a => Set a -> Set a -> Ordering
setCmp xs ys0 = S.foldr f z xs (S.toList ys0)
  where
    f x k = oneShot $ \ys -> case ys of
      [] -> GT
      y:ys' -> compare x y <> k ys'
    z ys = case ys of
      [] -> EQ
      _ -> LT

-- Somewhat like a traverse with StateT [a] (Either Bool)
setCmp2 :: Ord a => Set a -> Set a -> Ordering
setCmp2 xs ys0 = case go xs (S.toList ys0) of
  WithO o ys' ->
    o <> case ys' of
      [] -> EQ
      _ -> LT
  where
    go Tip ys = WithO EQ ys
    go (Bin _ x l r) ys = go l ys >>- f >>- go r
      where
        f [] = WithO GT []
        f (y:ys') = WithO (compare x y) ys'
    ox@(WithO o x) >>- f = case o of
      EQ -> f x
      _ -> ox

data WithO a = WithO !Ordering a

For now, converting the second set to a list is not avoided.

For IntSet and IntMap it could be avoided by utilizing the natural structure (#470, #787). That implementation will likely be quite different from the above style.
For Set and Map it could be avoided by implementing an "iterator" on the tree with a stack, but I'm not sure if this will be much better. I will try this out later.

Aug 05 '24 17:08 meooow25

Using the natural structure would probably a really fun little project.

Aug 06 '24 00:08 treeowl

Using the natural structure would probably a really fun little project.

I believe @jwaldmann is looking into it.

For Set and Map it could be avoided by implementing an "iterator" on the tree with a stack, but I'm not sure if this will be much better. I will try this out later.

It did turn out to be a little faster. I also realized setCmp2 can be written as a foldMap. I'll tidy things up into a PR for Set and Map.

Aug 06 '24 14:08 meooow25

Missed Seq, will have to look at that too...

Aug 07 '24 22:08 meooow25

Hmm... Seq could be interesting, maybe. While a right fold would be the thing to beat, it might be worth doing an experiment or two with the zipping machinery in case that turns out surprisingly fast for some reason.

Aug 07 '24 22:08 treeowl

I tried a few options for Seq.

compare_toList :: Ord a => Seq a -> Seq a -> Ordering
compare_toList xs ys = cmpList (toList xs) (toList ys)
{-# INLINABLE compare_toList #-}

cmpList :: Ord a => [a] -> [a] -> Ordering
cmpList [] [] = EQ
cmpList [] (_:_) = LT
cmpList (_:_) [] = GT
cmpList (x:xs) (y:ys) = compare x y <> cmpList xs ys
{-# INLINABLE cmpList #-}

compare_leftFused :: Ord a => Seq a -> Seq a -> Ordering
compare_leftFused xs ys0 = foldr f z xs (toList ys0)
  where
    -- btw, oneShot doesn't help
    f x k ys = case ys of
      [] -> GT
      y:ys' -> compare x y <> k ys'
    z [] = EQ
    z (_:_) = LT
{-# INLINABLE compare_leftFused #-}

compare_foldMapOrdM :: Ord a => Seq a -> Seq a -> Ordering
compare_foldMapOrdM xs ys0 = case runOrdM (foldMap f xs) (toList ys0) of
  o :*: ys -> o <> if F.null ys then EQ else LT
  where
    f x = OrdM $ \ys -> case ys of
      [] -> GT :*: []
      y:ys' -> compare x y :*: ys'
{-# INLINABLE compare_foldMapOrdM #-}

Here are the benchmark results (GHC 9.6.3). The compared sequences are [1..m] and [1..n].

Show benchmarks

  compare
    100-100
      compare:             OK
        1.53 μs ± 113 ns, 7.5 KB allocated,   1 B  copied,  10 MB peak memory
      compare_toList:      OK
        879  ns ±  86 ns, 7.2 KB allocated,   1 B  copied,  10 MB peak memory
      compare_leftFused:   OK
        907  ns ±  46 ns, 5.0 KB allocated,   0 B  copied,  10 MB peak memory
      compare_foldMapOrdM: OK
        1.02 μs ±  86 ns, 6.0 KB allocated,   0 B  copied,  10 MB peak memory
    10000-10000
      compare:             OK
        163  μs ±  13 μs, 748 KB allocated, 260 B  copied,  10 MB peak memory
      compare_toList:      OK
        94.4 μs ± 5.5 μs, 719 KB allocated, 236 B  copied,  10 MB peak memory
      compare_leftFused:   OK
        93.6 μs ± 6.1 μs, 502 KB allocated, 160 B  copied,  10 MB peak memory
      compare_foldMapOrdM: OK
        109  μs ± 5.6 μs, 592 KB allocated, 134 B  copied,  10 MB peak memory
    100-10000
      compare:             OK
        1.57 μs ± 111 ns, 7.8 KB allocated,   1 B  copied,  10 MB peak memory
      compare_toList:      OK
        887  ns ±  87 ns, 7.5 KB allocated,   1 B  copied,  10 MB peak memory
      compare_leftFused:   OK
        936  ns ±  92 ns, 5.2 KB allocated,   0 B  copied,  10 MB peak memory
      compare_foldMapOrdM: OK
        1.10 μs ±  90 ns, 6.3 KB allocated,   0 B  copied,  10 MB peak memory
    10000-100
      compare:             OK
        1.66 μs ±  97 ns, 7.7 KB allocated,   1 B  copied,  10 MB peak memory
      compare_toList:      OK
        919  ns ±  82 ns, 7.5 KB allocated,   1 B  copied,  10 MB peak memory
      compare_leftFused:   OK
        933  ns ±  89 ns, 5.1 KB allocated,   0 B  copied,  10 MB peak memory
      compare_foldMapOrdM: OK
        1.06 μs ±  96 ns, 6.0 KB allocated,   0 B  copied,  10 MB peak memory

It seems that compare_toList and compare_leftFused work best. The latter allocates less but is somehow slightly slower. So I'm not sure which of the two to pick.

The Core output is also too massive to read and reason about.

Other notes:

An iterator (like we did for Set and Map) is not possible because of the non-regular recursion in FingerTree. Or at least, I can't think of a way to do it that would keep the performance benefits.
The zip-like approach works outside-in. This wouldn't work for Ord because we need left-to-right comparison, but could be done for Eq. But I don't think it's worth it because all the splitting would has a cost. For zip-like functions the trade-off may be worth it to get faster indexing without evaluating where not required, but that doesn't apply here.

Sep 07 '24 14:09 meooow25

Reran the benchmarks with -O (benchmarks are all set to use -O2), and compare_toList is clearly faster than compare_leftFused even if it allocates more. This is convincing, so I'll make a PR for compare_toList.

Show benchmarks

  compare
    100-100
      compare:             OK
        1.55 μs ± 112 ns, 7.5 KB allocated,   1 B  copied,  10 MB peak memory
      compare_toList:      OK
        890  ns ±  88 ns, 7.5 KB allocated,   1 B  copied,  10 MB peak memory
      compare_leftFused:   OK
        1.34 μs ± 102 ns, 6.4 KB allocated,   1 B  copied,  10 MB peak memory
      compare_foldMapOrdM: OK
        1.11 μs ±  99 ns, 6.1 KB allocated,   0 B  copied,  10 MB peak memory
    10000-10000
      compare:             OK
        169  μs ±  15 μs, 748 KB allocated, 257 B  copied,  10 MB peak memory
      compare_toList:      OK
        104  μs ± 8.2 μs, 751 KB allocated, 257 B  copied,  10 MB peak memory
      compare_leftFused:   OK
        141  μs ±  12 μs, 643 KB allocated, 218 B  copied,  10 MB peak memory
      compare_foldMapOrdM: OK
        116  μs ± 5.3 μs, 608 KB allocated, 140 B  copied,  10 MB peak memory
    100-10000
      compare:             OK
        1.57 μs ±  97 ns, 7.8 KB allocated,   1 B  copied,  10 MB peak memory
      compare_toList:      OK
        954  ns ±  48 ns, 7.8 KB allocated,   1 B  copied,  10 MB peak memory
      compare_leftFused:   OK
        1.44 μs ±  87 ns, 6.7 KB allocated,   1 B  copied,  10 MB peak memory
      compare_foldMapOrdM: OK
        1.20 μs ±  96 ns, 6.5 KB allocated,   0 B  copied,  10 MB peak memory
    10000-100
      compare:             OK
        1.66 μs ±  92 ns, 7.7 KB allocated,   1 B  copied,  10 MB peak memory
      compare_toList:      OK
        983  ns ±  72 ns, 7.8 KB allocated,   1 B  copied,  10 MB peak memory
      compare_leftFused:   OK
        1.40 μs ± 114 ns, 6.6 KB allocated,   1 B  copied,  10 MB peak memory
      compare_foldMapOrdM: OK
        1.15 μs ±  85 ns, 6.1 KB allocated,   0 B  copied,  10 MB peak memory

Sep 07 '24 15:09 meooow25

I've come to the realization that the non-regular structure of Seq is terrible for the performance of folds. All the nesting and unnesting of closures, required only to make it typecheck, increases allocations and disables optimizations like arity analysis. Currently the folds are unrolled one level (thanks to #510), but the more we unroll the better it performs, with diminishing returns.

If we can switch to a regular structure,

 data FingerTree a
     = EmptyT
-    | Single a
-    | Deep {-# UNPACK #-} !Int !(Digit a) (FingerTree (Node a)) !(Digit a)
+    | Single (Tree a)
+    | Deep {-# UNPACK #-} !Int !(Digit (Tree a)) (FingerTree a) !(Digit (Tree a))

+ data Tree a
+    = Leaf a
+    | Node2 !Int a a
+    | Node3 !Int a a a

with the invariants maintained manually, we won't have this problem.
But is it worth sacrificing type-safety for efficiency? Or is there a way to have both?

I don't want to be diving into this right now, so I will go with the compare_toList implementation as far as this issue is concerned.

Sep 07 '24 22:09 meooow25

What you gain in fold performance gets lost elsewhere, I believe. A technique I used somewhere (priority queues, I think) is to start by making a GADT to represent depth in the tree while also supplying type evidence. Then that GADT can be turned into a fake GADT to improve efficiency, using unsafeCoerce and nice pattern synonyms. This would be a bit different, because we don't use a higher order nested datatype; I'm not totally sure if the same technique would apply. I kind of suspect it would though. Something vaguely like

data Foo a where
  This :: Foo (Elem a)
  That :: !(Foo a) -> Foo (Node a)

Sep 08 '24 07:09 treeowl

What you gain in fold performance gets lost elsewhere, I believe.

Yes that's possible. I will still want to try it out at some point.

A technique I used somewhere (priority queues, I think) is to start by making a GADT to represent depth in the tree while also supplying type evidence. Then that GADT can be turned into a fake GADT to improve efficiency, using unsafeCoerce and nice pattern synonyms.

I was thinking about using GADTs too, but what is fake GADT?

Sep 08 '24 13:09 meooow25

It's when you play make-believe with the type checker. Here's the code I mean.

Sep 08 '24 13:09 treeowl

Let me expand on the GADT approach a bit. Using a real GADT, we could do this:

data Where a n where
  Here :: Where a (Elem a)
  There :: !(Where a n) -> Where a (Node n)

foldMapBlob :: Monoid m => Where a t -> (a -> m) -> t -> m
foldMapBlob Here f (Elem a) = f a
foldMapBlob (There w) f (Node2 _ x y) = foldMapBlob w f x <> foldMapBlob w f y
foldMapBlob (There w) f (Node3 _ x y z) = foldMapBlob w f x <> foldMapBlob w f y <> foldMapBlob w f z

foldMapFT :: Monoid m => Where a t -> (a -> m) -> FingerTree t -> m
foldMapFT !_ _ EmptyT = mempty
foldMapFT w f (Single t) = foldMapBlob w f t
foldMapFT w f (Deep _ pr m sf) =
  foldMap (foldMapBlob w f) pr
  <> foldMapFT (There w) f m
  <> foldMap (foldMapBlob w f) sf

foldMapSeq :: Monoid m => (a -> m) -> Seq a -> m
foldMapSeq f (Seq t) = foldMapFT Here f t

We could use a fake GADT specialized to Elem and Node, but that would have to live in Data.Sequence.Internal. Alternatively, we could factor it out into its own (internally dangerous, but externally safe) module, by generalizing to (the faked version of)

data Where el no a n where
  Here :: Where el no a (el a)
  There :: !(Where el no a n) -> Where el no a (no n)

foldMapBlob :: Monoid m => Where Elem Node a t -> (a -> m) -> t -> m
foldMapBlob Here f (Elem a) = f a
foldMapBlob (There w) f (Node2 _ x y) = foldMapBlob w f x <> foldMapBlob w f y
foldMapBlob (There w) f (Node3 _ x y z) = foldMapBlob w f x <> foldMapBlob w f y <> foldMapBlob w f z

foldMapFT :: Monoid m => Where Elem Node a t -> (a -> m) -> FingerTree t -> m
foldMapFT !_ _ EmptyT = mempty
foldMapFT w f (Single t) = foldMapBlob w f t
foldMapFT w f (Deep _ pr m sf) =
  foldMap (foldMapBlob w f) pr
  <> foldMapFT (There w) f m
  <> foldMap (foldMapBlob w f) sf

foldMapSeq :: Monoid m => (a -> m) -> Seq a -> m
foldMapSeq f (Seq t) = foldMapFT Here f t

The main efficiency advantage of the nested type approach is that the leaves (Elem) don't have any structure; they're just pointers to the elements themselves. The tractable alternative would be to use a GADT (I don't think it's at all a good idea to stop using the type system to enforce tree shape). If we do that, we can recover compactness by switching to specialized leaves. Roughly speaking,

data Leaf a = Leaf2 a a | Leaf3 a a a

This possibility has been discussed in the issue tracker before, somewhere. It's totally plausible, but

Code size for some of the most critical operations will go up.
It's an awful lot of human work to port all the code to such a representation.

Sep 09 '24 03:09 treeowl

(Please don't actually use Where, Here, and There. Those were just what came to mind in the moment.)

Sep 09 '24 03:09 treeowl

... All the nesting and unnesting of closures, required only to make it typecheck,

(half jokingly) then fix GHC, instead of uglifying the code ...

what are the real-life applications of Data.Sequence that suffer from this inefficiency?

My main use case for Data.Sequence is: to show the source code to students, as an example where typing enforces balance ...

I use Data.Sequence in real code from time to time but it's not critical for performance (in the cases that I profiled).

Perhaps this is because I do not use Eq and Ord of containers much at all. These instances are certainly nice to have, e.g., to quickly use a Set as a key for a Map, but I found this is too expensive (e.g., for the powerset construction of an automaton), and needs to be replaced with some form of hashing, so the key of the hot Map is Int.

This is also why I am a bit slow with #787 (it is still on my list but other stuff is more urgent right now, sorry)

Sep 09 '24 12:09 jwaldmann

Let me expand on the GADT approach a bit. Using a real GADT, we could do this:

Nice! Why don't we do this? I expect this would be more efficient than what we have, even without the unsafe trick.

The main efficiency advantage of the nested type approach is that the leaves (Elem) don't have any structure; they're just pointers to the elements themselves.

Yes, I can see this being a disadvantage of changing the representation.

... I found this is too expensive (e.g., for the powerset construction of an automaton), and needs to be replaced with some form of hashing, so the key of the hot Map is Int.

Hashing is a fold after all, so you stand to benefit from improved fold functions ;)

Sep 09 '24 16:09 meooow25

We've never used GADTs in containers. It's part of the tradition of trying to keep it plausibly portable to potential future Haskell implementations that don't have such fancy type systems as GHC. Is it a tradition worth upholding? I'm not always the best at making such decisions.

I think the unsafe trick is totally fine, as long as its limitations are respected to avoid the risk of nasal demons. That mostly means being very careful not to overflow the counter. Because sequences are weird, it's probably safest to do an overflow check for increment even on 64-bit systems, though there are fold contexts where that's overkill.

Sep 09 '24 16:09 treeowl

(Sorry, just passing by) The only other modern implementation of Haskell is MicroHs, and it claims to support GADTs already.

Sep 09 '24 19:09 Bodigrim

@Bodigrim , are people using MicroHs? If so, we should be sure to make containers work on it!

Sep 10 '24 00:09 treeowl

I think the unsafe trick is totally fine, as long as its limitations are respected to avoid the risk of nasal demons. That mostly means being very careful not to overflow the counter. Because sequences are weird, it's probably safest to do an overflow check for increment even on 64-bit systems, though there are fold contexts where that's overkill.

The counter only goes up to the depth of the tree which is O(log n), so I don't think we need to worry about overflow.

We've never used GADTs in containers. It's part of the tradition of trying to keep it plausibly portable to potential future Haskell implementations that don't have such fancy type systems as GHC.

We could have a separate definition for GHC under some CPP. Or, if we don't want to maintain two definitions, instead of applying unsafeCoerce in the fake GADT we could use it to directly coerce the thing in question to a Node or Elem. Do we consider that unsafeCoerce is available in non-GHC compilers?

Sep 10 '24 13:09 meooow25

I think the unsafe trick is totally fine, as long as its limitations are respected to avoid the risk of nasal demons. That mostly means being very careful not to overflow the counter. Because sequences are weird, it's probably safest to do an overflow check for increment even on 64-bit systems, though there are fold contexts where that's overkill.

The counter only goes up to the depth of the tree which is O(log n), so I don't think we need to worry about overflow.

That particular bit of reasoning has a false premise: that tree depth is certain never to exceed maxBound @Word. While we assume throughout that it does not exceed maxBound @Int, leaving it to the user to enforce that restriction, we can't be so carefree about actual memory safety. And it's very easy to violate: consider replicate maxBound () *> replicate maxBound (), a sequence with maxBound^2 elements that doesn't even occupy much memory. So I'm certain we have to be concerned about this for 32-bit systems, and for 64-bit ones, the only question is how long it would take for a very fast computer to overflow the counter.

We've never used GADTs in containers. It's part of the tradition of trying to keep it plausibly portable to potential future Haskell implementations that don't have such fancy type systems as GHC.

We could have a separate definition for GHC under some CPP. Or, if we don't want to maintain two definitions, instead of applying unsafeCoerce in the fake GADT we could use it to directly coerce the thing in question to a Node or Elem. Do we consider that unsafeCoerce is available in non-GHC compilers?

Directly coercing the thing in question leads to quite a large "trusted base"—random unsafe coercions scattered around already intricate code. Let's not go there. CPP would suck: quite a lot of code is involved once this trick is used everywhere that would benefit from it. That includes pretty much everything that reaches deep into trees: maps, folds, traversals, zips, indexing, and I imagine even the mind-warping inits, tails, and Applicative methods. I think we pretty much have to decide to commit to GADTs or not. ISTR @ekmett had opinions on that.

Sep 10 '24 15:09 treeowl

Hold on, I missed a logarithm, didn't I? Let me think about that again; you're probably right.

Sep 10 '24 15:09 treeowl

Hold on, I missed a logarithm, didn't I? Let me think about that again; you're probably right.

No you're right, I can see this being a problem. We need O(2^maxBound) elements, which could be done by repeatedly appending a sequence to itself maxBound times. Feasible for 32-bit systems? So we will need a bounds check.

Directly coercing the thing in question leads to quite a large "trusted base"—random unsafe coercions scattered around already intricate code. Let's not go there.

Understandable... it's not an appealing option.

Sep 10 '24 15:09 meooow25

Repeated appending risks multiplication by doubling, which could definitely be an issue on 32-bit systems. But repeated *> is much worse; it risks exponentiation by squaring! We've thought about adding bounds checks to the module in the past, but AFAIK no one's taken the time to investigate the performance impact.

Sep 10 '24 18:09 treeowl

@Bodigrim , are people using MicroHs? If so, we should be sure to make containers work on it!

I like the project a lot (Haskell on microcontrollers!), but I think it will take some time before compatibility with MicroHS will become a practical question. It's in its early days.

Sep 10 '24 22:09 Bodigrim

Regarding the alternate representation,

This possibility has been discussed in the issue tracker before, somewhere.

Found it: #91 Looks like it is type safe but also needs GADTs.

I also realized something neat about these two representations. It comes down to the question: How we do know when we reach a leaf with an element?

For #91 (or https://github.com/haskell/containers/issues/1016#issuecomment-2336465956), the leaves are identified by their constructor tag, very simple. (Keeping aside the specialized leaves)
For the current representation, that is not the case. Instead, when going down the tree we must do some work and keep track of the depth using some mechanism (closures, GADT, fake GADT) to know when to stop.

So, this is a good old space-vs-time trade-off.

Paying the space to simplify/speed up a lot of other things seems like a good deal to me personally. But it remains to be seen how well it would work in practice.

Sep 11 '24 15:09 meooow25

Space typically has a huge impact on time. I suspect you'll find that trading compactness for speed will get you neither. But you're welcome to try.

Sep 11 '24 15:09 treeowl

Closing this issue. The Seq instances may improve further if #1036 gets done, but that's a question for later.

Jan 30 '25 14:01 meooow25