Faster Eq and Ord
The following operations are implemented by converting to lists and comparing.
- [x] Set
==(improved in #1017) - [x] Set
compare(same as above) - [x] Map
==(same as above) - [x] Map
compare(same as above) - [ ] IntSet
compare - [ ] IntMap
compare - [x] Seq
==(#1035) - [x] Seq
compare(same as above)
This is not the best we can do for a couple of reasons:
- No specialization.
EqandOrdfor lists don't specialize (but there a few manually specified exceptions). - No list fusion. We materialize both lists.
I tried a couple of alternatives and saw decent improvements
Set Word
compare: OK
3.03 ms ± 189 μs, 11 MB allocated, 1.8 MB copied, 23 MB peak memory
setCmp: OK
1.64 ms ± 143 μs, 8.4 MB allocated, 583 KB copied, 23 MB peak memory, 0.54x
setCmp2: OK
947 μs ± 68 μs, 5.3 MB allocated, 795 B copied, 23 MB peak memory, 0.31x
-- Current version but fused with the first set
setCmp :: Ord a => Set a -> Set a -> Ordering
setCmp xs ys0 = S.foldr f z xs (S.toList ys0)
where
f x k = oneShot $ \ys -> case ys of
[] -> GT
y:ys' -> compare x y <> k ys'
z ys = case ys of
[] -> EQ
_ -> LT
{-# INLINABLE setCmp #-}
-- Somewhat like a traverse with StateT [a] (Either Bool)
setCmp2 :: Ord a => Set a -> Set a -> Ordering
setCmp2 xs ys0 = case go xs (S.toList ys0) of
WithO o ys' ->
o <> case ys' of
[] -> EQ
_ -> LT
where
go Tip ys = WithO EQ ys
go (Bin _ x l r) ys = go l ys >>- f >>- go r
where
f [] = WithO GT []
f (y:ys') = WithO (compare x y) ys'
ox@(WithO o x) >>- f = case o of
EQ -> f x
_ -> ox
{-# INLINABLE setCmp2 #-}
data WithO a = WithO !Ordering a
Benchmark code
import Test.Tasty.Bench
import qualified Data.Set as S
import Data.Set.Internal (Set(..))
import GHC.Exts (oneShot)
main :: IO ()
main = defaultMain
[ env (pure (S.fromList xs_)) $ \xs -> bgroup "Set Word"
[ bench "eq" $ whnf (\ys -> ys == ys) xs
, bench "compare" $ whnf (\ys -> compare ys ys) xs
, bcompare "$0 == \"All.Set.eq\"" $ bench "eq new" $ whnf (\ys -> setEq ys ys) xs
, bcompare "$0 == \"All.Set.compare\"" $ bench "setCmp" $ whnf (\ys -> setCmp ys ys) xs
, bcompare "$0 == \"All.Set.compare\"" $ bench "setCmp2" $ whnf (\ys -> setCmp2 ys ys) xs
]
]
where
xs_ = [1..100000] :: [Word]
setEq :: Eq a => Set a -> Set a -> Bool
setEq xs ys0 = S.foldr f z xs (S.toList ys0)
where
f x k = oneShot $ \ys -> case ys of
[] -> False
y:ys' -> x == y && k ys'
z ys = case ys of
[] -> True
_ -> False
-- Current version but fused with the first set
setCmp :: Ord a => Set a -> Set a -> Ordering
setCmp xs ys0 = S.foldr f z xs (S.toList ys0)
where
f x k = oneShot $ \ys -> case ys of
[] -> GT
y:ys' -> compare x y <> k ys'
z ys = case ys of
[] -> EQ
_ -> LT
-- Somewhat like a traverse with StateT [a] (Either Bool)
setCmp2 :: Ord a => Set a -> Set a -> Ordering
setCmp2 xs ys0 = case go xs (S.toList ys0) of
WithO o ys' ->
o <> case ys' of
[] -> EQ
_ -> LT
where
go Tip ys = WithO EQ ys
go (Bin _ x l r) ys = go l ys >>- f >>- go r
where
f [] = WithO GT []
f (y:ys') = WithO (compare x y) ys'
ox@(WithO o x) >>- f = case o of
EQ -> f x
_ -> ox
data WithO a = WithO !Ordering a
For now, converting the second set to a list is not avoided.
- For IntSet and IntMap it could be avoided by utilizing the natural structure (#470, #787). That implementation will likely be quite different from the above style.
- For Set and Map it could be avoided by implementing an "iterator" on the tree with a stack, but I'm not sure if this will be much better. I will try this out later.
Using the natural structure would probably a really fun little project.
Using the natural structure would probably a really fun little project.
I believe @jwaldmann is looking into it.
For Set and Map it could be avoided by implementing an "iterator" on the tree with a stack, but I'm not sure if this will be much better. I will try this out later.
It did turn out to be a little faster. I also realized setCmp2 can be written as a foldMap. I'll tidy things up into a PR for Set and Map.
Missed Seq, will have to look at that too...
Hmm... Seq could be interesting, maybe. While a right fold would be the thing to beat, it might be worth doing an experiment or two with the zipping machinery in case that turns out surprisingly fast for some reason.
I tried a few options for Seq.
compare_toList :: Ord a => Seq a -> Seq a -> Ordering
compare_toList xs ys = cmpList (toList xs) (toList ys)
{-# INLINABLE compare_toList #-}
cmpList :: Ord a => [a] -> [a] -> Ordering
cmpList [] [] = EQ
cmpList [] (_:_) = LT
cmpList (_:_) [] = GT
cmpList (x:xs) (y:ys) = compare x y <> cmpList xs ys
{-# INLINABLE cmpList #-}
compare_leftFused :: Ord a => Seq a -> Seq a -> Ordering
compare_leftFused xs ys0 = foldr f z xs (toList ys0)
where
-- btw, oneShot doesn't help
f x k ys = case ys of
[] -> GT
y:ys' -> compare x y <> k ys'
z [] = EQ
z (_:_) = LT
{-# INLINABLE compare_leftFused #-}
compare_foldMapOrdM :: Ord a => Seq a -> Seq a -> Ordering
compare_foldMapOrdM xs ys0 = case runOrdM (foldMap f xs) (toList ys0) of
o :*: ys -> o <> if F.null ys then EQ else LT
where
f x = OrdM $ \ys -> case ys of
[] -> GT :*: []
y:ys' -> compare x y :*: ys'
{-# INLINABLE compare_foldMapOrdM #-}
Here are the benchmark results (GHC 9.6.3). The compared sequences are [1..m] and [1..n].
Show benchmarks
compare
100-100
compare: OK
1.53 μs ± 113 ns, 7.5 KB allocated, 1 B copied, 10 MB peak memory
compare_toList: OK
879 ns ± 86 ns, 7.2 KB allocated, 1 B copied, 10 MB peak memory
compare_leftFused: OK
907 ns ± 46 ns, 5.0 KB allocated, 0 B copied, 10 MB peak memory
compare_foldMapOrdM: OK
1.02 μs ± 86 ns, 6.0 KB allocated, 0 B copied, 10 MB peak memory
10000-10000
compare: OK
163 μs ± 13 μs, 748 KB allocated, 260 B copied, 10 MB peak memory
compare_toList: OK
94.4 μs ± 5.5 μs, 719 KB allocated, 236 B copied, 10 MB peak memory
compare_leftFused: OK
93.6 μs ± 6.1 μs, 502 KB allocated, 160 B copied, 10 MB peak memory
compare_foldMapOrdM: OK
109 μs ± 5.6 μs, 592 KB allocated, 134 B copied, 10 MB peak memory
100-10000
compare: OK
1.57 μs ± 111 ns, 7.8 KB allocated, 1 B copied, 10 MB peak memory
compare_toList: OK
887 ns ± 87 ns, 7.5 KB allocated, 1 B copied, 10 MB peak memory
compare_leftFused: OK
936 ns ± 92 ns, 5.2 KB allocated, 0 B copied, 10 MB peak memory
compare_foldMapOrdM: OK
1.10 μs ± 90 ns, 6.3 KB allocated, 0 B copied, 10 MB peak memory
10000-100
compare: OK
1.66 μs ± 97 ns, 7.7 KB allocated, 1 B copied, 10 MB peak memory
compare_toList: OK
919 ns ± 82 ns, 7.5 KB allocated, 1 B copied, 10 MB peak memory
compare_leftFused: OK
933 ns ± 89 ns, 5.1 KB allocated, 0 B copied, 10 MB peak memory
compare_foldMapOrdM: OK
1.06 μs ± 96 ns, 6.0 KB allocated, 0 B copied, 10 MB peak memory
It seems that compare_toList and compare_leftFused work best. The latter allocates less but is somehow slightly slower. So I'm not sure which of the two to pick.
The Core output is also too massive to read and reason about.
Other notes:
- An iterator (like we did for Set and Map) is not possible because of the non-regular recursion in FingerTree. Or at least, I can't think of a way to do it that would keep the performance benefits.
- The zip-like approach works outside-in. This wouldn't work for Ord because we need left-to-right comparison, but could be done for Eq. But I don't think it's worth it because all the splitting would has a cost. For zip-like functions the trade-off may be worth it to get faster indexing without evaluating where not required, but that doesn't apply here.
Reran the benchmarks with -O (benchmarks are all set to use -O2), and compare_toList is clearly faster than compare_leftFused even if it allocates more. This is convincing, so I'll make a PR for compare_toList.
Show benchmarks
compare
100-100
compare: OK
1.55 μs ± 112 ns, 7.5 KB allocated, 1 B copied, 10 MB peak memory
compare_toList: OK
890 ns ± 88 ns, 7.5 KB allocated, 1 B copied, 10 MB peak memory
compare_leftFused: OK
1.34 μs ± 102 ns, 6.4 KB allocated, 1 B copied, 10 MB peak memory
compare_foldMapOrdM: OK
1.11 μs ± 99 ns, 6.1 KB allocated, 0 B copied, 10 MB peak memory
10000-10000
compare: OK
169 μs ± 15 μs, 748 KB allocated, 257 B copied, 10 MB peak memory
compare_toList: OK
104 μs ± 8.2 μs, 751 KB allocated, 257 B copied, 10 MB peak memory
compare_leftFused: OK
141 μs ± 12 μs, 643 KB allocated, 218 B copied, 10 MB peak memory
compare_foldMapOrdM: OK
116 μs ± 5.3 μs, 608 KB allocated, 140 B copied, 10 MB peak memory
100-10000
compare: OK
1.57 μs ± 97 ns, 7.8 KB allocated, 1 B copied, 10 MB peak memory
compare_toList: OK
954 ns ± 48 ns, 7.8 KB allocated, 1 B copied, 10 MB peak memory
compare_leftFused: OK
1.44 μs ± 87 ns, 6.7 KB allocated, 1 B copied, 10 MB peak memory
compare_foldMapOrdM: OK
1.20 μs ± 96 ns, 6.5 KB allocated, 0 B copied, 10 MB peak memory
10000-100
compare: OK
1.66 μs ± 92 ns, 7.7 KB allocated, 1 B copied, 10 MB peak memory
compare_toList: OK
983 ns ± 72 ns, 7.8 KB allocated, 1 B copied, 10 MB peak memory
compare_leftFused: OK
1.40 μs ± 114 ns, 6.6 KB allocated, 1 B copied, 10 MB peak memory
compare_foldMapOrdM: OK
1.15 μs ± 85 ns, 6.1 KB allocated, 0 B copied, 10 MB peak memory
I've come to the realization that the non-regular structure of Seq is terrible for the performance of folds. All the nesting and unnesting of closures, required only to make it typecheck, increases allocations and disables optimizations like arity analysis. Currently the folds are unrolled one level (thanks to #510), but the more we unroll the better it performs, with diminishing returns.
If we can switch to a regular structure,
data FingerTree a
= EmptyT
- | Single a
- | Deep {-# UNPACK #-} !Int !(Digit a) (FingerTree (Node a)) !(Digit a)
+ | Single (Tree a)
+ | Deep {-# UNPACK #-} !Int !(Digit (Tree a)) (FingerTree a) !(Digit (Tree a))
+ data Tree a
+ = Leaf a
+ | Node2 !Int a a
+ | Node3 !Int a a a
with the invariants maintained manually, we won't have this problem.
But is it worth sacrificing type-safety for efficiency? Or is there a way to have both?
I don't want to be diving into this right now, so I will go with the compare_toList implementation as far as this issue is concerned.
What you gain in fold performance gets lost elsewhere, I believe. A technique I used somewhere (priority queues, I think) is to start by making a GADT to represent depth in the tree while also supplying type evidence. Then that GADT can be turned into a fake GADT to improve efficiency, using unsafeCoerce and nice pattern synonyms. This would be a bit different, because we don't use a higher order nested datatype; I'm not totally sure if the same technique would apply. I kind of suspect it would though. Something vaguely like
data Foo a where
This :: Foo (Elem a)
That :: !(Foo a) -> Foo (Node a)
What you gain in fold performance gets lost elsewhere, I believe.
Yes that's possible. I will still want to try it out at some point.
A technique I used somewhere (priority queues, I think) is to start by making a GADT to represent depth in the tree while also supplying type evidence. Then that GADT can be turned into a fake GADT to improve efficiency, using
unsafeCoerceand nice pattern synonyms.
I was thinking about using GADTs too, but what is fake GADT?
It's when you play make-believe with the type checker. Here's the code I mean.
Let me expand on the GADT approach a bit. Using a real GADT, we could do this:
data Where a n where
Here :: Where a (Elem a)
There :: !(Where a n) -> Where a (Node n)
foldMapBlob :: Monoid m => Where a t -> (a -> m) -> t -> m
foldMapBlob Here f (Elem a) = f a
foldMapBlob (There w) f (Node2 _ x y) = foldMapBlob w f x <> foldMapBlob w f y
foldMapBlob (There w) f (Node3 _ x y z) = foldMapBlob w f x <> foldMapBlob w f y <> foldMapBlob w f z
foldMapFT :: Monoid m => Where a t -> (a -> m) -> FingerTree t -> m
foldMapFT !_ _ EmptyT = mempty
foldMapFT w f (Single t) = foldMapBlob w f t
foldMapFT w f (Deep _ pr m sf) =
foldMap (foldMapBlob w f) pr
<> foldMapFT (There w) f m
<> foldMap (foldMapBlob w f) sf
foldMapSeq :: Monoid m => (a -> m) -> Seq a -> m
foldMapSeq f (Seq t) = foldMapFT Here f t
We could use a fake GADT specialized to Elem and Node, but that would have to live in Data.Sequence.Internal. Alternatively, we could factor it out into its own (internally dangerous, but externally safe) module, by generalizing to (the faked version of)
data Where el no a n where
Here :: Where el no a (el a)
There :: !(Where el no a n) -> Where el no a (no n)
foldMapBlob :: Monoid m => Where Elem Node a t -> (a -> m) -> t -> m
foldMapBlob Here f (Elem a) = f a
foldMapBlob (There w) f (Node2 _ x y) = foldMapBlob w f x <> foldMapBlob w f y
foldMapBlob (There w) f (Node3 _ x y z) = foldMapBlob w f x <> foldMapBlob w f y <> foldMapBlob w f z
foldMapFT :: Monoid m => Where Elem Node a t -> (a -> m) -> FingerTree t -> m
foldMapFT !_ _ EmptyT = mempty
foldMapFT w f (Single t) = foldMapBlob w f t
foldMapFT w f (Deep _ pr m sf) =
foldMap (foldMapBlob w f) pr
<> foldMapFT (There w) f m
<> foldMap (foldMapBlob w f) sf
foldMapSeq :: Monoid m => (a -> m) -> Seq a -> m
foldMapSeq f (Seq t) = foldMapFT Here f t
The main efficiency advantage of the nested type approach is that the leaves (Elem) don't have any structure; they're just pointers to the elements themselves. The tractable alternative would be to use a GADT (I don't think it's at all a good idea to stop using the type system to enforce tree shape). If we do that, we can recover compactness by switching to specialized leaves. Roughly speaking,
data Leaf a = Leaf2 a a | Leaf3 a a a
This possibility has been discussed in the issue tracker before, somewhere. It's totally plausible, but
- Code size for some of the most critical operations will go up.
- It's an awful lot of human work to port all the code to such a representation.
(Please don't actually use Where, Here, and There. Those were just what came to mind in the moment.)
... All the nesting and unnesting of closures, required only to make it typecheck,
(half jokingly) then fix GHC, instead of uglifying the code ...
what are the real-life applications of Data.Sequence that suffer from this inefficiency?
My main use case for Data.Sequence is: to show the source code to students, as an example where typing enforces balance ...
I use Data.Sequence in real code from time to time but it's not critical for performance (in the cases that I profiled).
Perhaps this is because I do not use Eq and Ord of containers much at all. These instances are certainly nice to have, e.g., to quickly use a Set as a key for a Map, but I found this is too expensive (e.g., for the powerset construction of an automaton), and needs to be replaced with some form of hashing, so the key of the hot Map is Int.
This is also why I am a bit slow with #787 (it is still on my list but other stuff is more urgent right now, sorry)
Let me expand on the GADT approach a bit. Using a real GADT, we could do this:
Nice! Why don't we do this? I expect this would be more efficient than what we have, even without the unsafe trick.
The main efficiency advantage of the nested type approach is that the leaves (
Elem) don't have any structure; they're just pointers to the elements themselves.
Yes, I can see this being a disadvantage of changing the representation.
... I found this is too expensive (e.g., for the powerset construction of an automaton), and needs to be replaced with some form of hashing, so the key of the hot Map is Int.
Hashing is a fold after all, so you stand to benefit from improved fold functions ;)
We've never used GADTs in containers. It's part of the tradition of trying to keep it plausibly portable to potential future Haskell implementations that don't have such fancy type systems as GHC. Is it a tradition worth upholding? I'm not always the best at making such decisions.
I think the unsafe trick is totally fine, as long as its limitations are respected to avoid the risk of nasal demons. That mostly means being very careful not to overflow the counter. Because sequences are weird, it's probably safest to do an overflow check for increment even on 64-bit systems, though there are fold contexts where that's overkill.
(Sorry, just passing by)
The only other modern implementation of Haskell is MicroHs, and it claims to support GADTs already.
@Bodigrim , are people using MicroHs? If so, we should be sure to make containers work on it!
I think the unsafe trick is totally fine, as long as its limitations are respected to avoid the risk of nasal demons. That mostly means being very careful not to overflow the counter. Because sequences are weird, it's probably safest to do an overflow check for increment even on 64-bit systems, though there are fold contexts where that's overkill.
The counter only goes up to the depth of the tree which is O(log n), so I don't think we need to worry about overflow.
We've never used GADTs in
containers. It's part of the tradition of trying to keep it plausibly portable to potential future Haskell implementations that don't have such fancy type systems as GHC.
We could have a separate definition for GHC under some CPP. Or, if we don't want to maintain two definitions, instead of applying unsafeCoerce in the fake GADT we could use it to directly coerce the thing in question to a Node or Elem. Do we consider that unsafeCoerce is available in non-GHC compilers?
I think the unsafe trick is totally fine, as long as its limitations are respected to avoid the risk of nasal demons. That mostly means being very careful not to overflow the counter. Because sequences are weird, it's probably safest to do an overflow check for increment even on 64-bit systems, though there are fold contexts where that's overkill.
The counter only goes up to the depth of the tree which is O(log n), so I don't think we need to worry about overflow.
That particular bit of reasoning has a false premise: that tree depth is certain never to exceed maxBound @Word. While we assume throughout that it does not exceed maxBound @Int, leaving it to the user to enforce that restriction, we can't be so carefree about actual memory safety. And it's very easy to violate: consider replicate maxBound () *> replicate maxBound (), a sequence with maxBound^2 elements that doesn't even occupy much memory. So I'm certain we have to be concerned about this for 32-bit systems, and for 64-bit ones, the only question is how long it would take for a very fast computer to overflow the counter.
We've never used GADTs in
containers. It's part of the tradition of trying to keep it plausibly portable to potential future Haskell implementations that don't have such fancy type systems as GHC.We could have a separate definition for GHC under some CPP. Or, if we don't want to maintain two definitions, instead of applying
unsafeCoercein the fake GADT we could use it to directly coerce the thing in question to aNodeorElem. Do we consider thatunsafeCoerceis available in non-GHC compilers?
Directly coercing the thing in question leads to quite a large "trusted base"—random unsafe coercions scattered around already intricate code. Let's not go there. CPP would suck: quite a lot of code is involved once this trick is used everywhere that would benefit from it. That includes pretty much everything that reaches deep into trees: maps, folds, traversals, zips, indexing, and I imagine even the mind-warping inits, tails, and Applicative methods. I think we pretty much have to decide to commit to GADTs or not. ISTR @ekmett had opinions on that.
Hold on, I missed a logarithm, didn't I? Let me think about that again; you're probably right.
Hold on, I missed a logarithm, didn't I? Let me think about that again; you're probably right.
No you're right, I can see this being a problem. We need O(2^maxBound) elements, which could be done by repeatedly appending a sequence to itself maxBound times. Feasible for 32-bit systems? So we will need a bounds check.
Directly coercing the thing in question leads to quite a large "trusted base"—random unsafe coercions scattered around already intricate code. Let's not go there.
Understandable... it's not an appealing option.
Repeated appending risks multiplication by doubling, which could definitely be an issue on 32-bit systems. But repeated *> is much worse; it risks exponentiation by squaring! We've thought about adding bounds checks to the module in the past, but AFAIK no one's taken the time to investigate the performance impact.
@Bodigrim , are people using MicroHs? If so, we should be sure to make
containerswork on it!
I like the project a lot (Haskell on microcontrollers!), but I think it will take some time before compatibility with MicroHS will become a practical question. It's in its early days.
Regarding the alternate representation,
This possibility has been discussed in the issue tracker before, somewhere.
Found it: #91 Looks like it is type safe but also needs GADTs.
I also realized something neat about these two representations. It comes down to the question: How we do know when we reach a leaf with an element?
- For #91 (or https://github.com/haskell/containers/issues/1016#issuecomment-2336465956), the leaves are identified by their constructor tag, very simple. (Keeping aside the specialized leaves)
- For the current representation, that is not the case. Instead, when going down the tree we must do some work and keep track of the depth using some mechanism (closures, GADT, fake GADT) to know when to stop.
So, this is a good old space-vs-time trade-off.
Paying the space to simplify/speed up a lot of other things seems like a good deal to me personally. But it remains to be seen how well it would work in practice.
Space typically has a huge impact on time. I suspect you'll find that trading compactness for speed will get you neither. But you're welcome to try.
Closing this issue. The Seq instances may improve further if #1036 gets done, but that's a question for later.