Fusible IntSet.fromDistinctAscList definition
Sister issue to #949. I adapted the old algorithm from #658 into a fold to see if it improves things for IntSet like it did for Set.
fromDistinctAscList :: [Key] -> IntSet
fromDistinctAscList = finish . Foldable.foldl' next StateEmpty
where
next StateEmpty z = State (prefixOf z) (bitmapOf z) Nada
next (State kx bm stk) z
| kx == pz = State pz (bm .|. bmz) stk
| otherwise = State pz bmz (linkTop (branchMask z kx) kx (Tip kx bm) stk)
where
pz = prefixOf z
bmz = bitmapOf z
finish StateEmpty = Nil
finish (State kx bm stk0) = linkAll kx (Tip kx bm) stk0
{-# INLINE fromDistinctAscList #-}
linkTop :: Mask -> Prefix -> IntSet -> Stack -> Stack
linkTop !m !px !tx (Push py ty stk)
| shorter m mxy = linkTop m pxy (Bin pxy mxy ty tx) stk
where
mxy = branchMask px py
pxy = mask px mxy
linkTop _ px tx stk = Push px tx stk
{-# INLINABLE linkTop #-}
linkAll :: Prefix -> IntSet -> Stack -> IntSet
linkAll !_ tx Nada = tx
linkAll px tx (Push py ty stk)
| m < 0 = Bin p m tx ty
| otherwise = linkAll p (Bin p m ty tx) stk
where
m = branchMask px py
p = mask px m
{-# INLINABLE linkAll #-}
data FromDistinctMonoState = State {-# UNPACK #-} !Prefix {-# UNPACK #-} !BitMap !Stack | StateEmpty
data Stack = Push {-# UNPACK #-} !Prefix !IntSet !Stack | Nada
Benchmarking 4 scenarios
bench "fromDistinctAscList:dense" $ whnf IS.fromDistinctAscList elems -- elems = [1..n]
, bench "fromDistinctAscList:dense:fusion" $
whnf (\n -> IS.fromDistinctAscList [1..n]) bound
, bench "fromDistinctAscList:sparse" $ whnf IS.fromDistinctAscList elems_sparse -- elems_sparse = map (*64) [1..n]
, bench "fromDistinctAscList:sparse:fusion" $
whnf (\n -> IS.fromDistinctAscList (map (*64) [1..n])) bound
GHC 9.2.5, current
fromDistinctAscList:dense: OK (0.22s)
13.6 μs ± 701 ns, 4.0 KB allocated, 3 B copied, 7.0 MB peak memory
fromDistinctAscList:dense:fusion: OK (0.20s)
24.3 μs ± 1.5 μs, 291 KB allocated, 150 B copied, 7.0 MB peak memory
fromDistinctAscList:sparse: OK (0.20s)
52.2 μs ± 3.3 μs, 256 KB allocated, 16 KB copied, 7.0 MB peak memory
fromDistinctAscList:sparse:fusion: OK (0.12s)
68.3 μs ± 6.1 μs, 541 KB allocated, 17 KB copied, 7.0 MB peak memory
GHC 9.2.5, new
fromDistinctAscList:dense: OK (0.23s)
14.0 μs ± 1.1 μs, 6.0 KB allocated, 3 B copied, 7.0 MB peak memory, same as baseline
fromDistinctAscList:dense:fusion: OK (0.15s)
4.49 μs ± 349 ns, 6.0 KB allocated, 3 B copied, 7.0 MB peak memory, 81% less than baseline
fromDistinctAscList:sparse: OK (0.14s)
70.2 μs ± 5.5 μs, 383 KB allocated, 12 KB copied, 7.0 MB peak memory, 34% more than baseline
fromDistinctAscList:sparse:fusion: OK (0.13s)
60.9 μs ± 5.3 μs, 382 KB allocated, 12 KB copied, 7.0 MB peak memory, 10% less than baseline
GHC 9.6.2, current
fromDistinctAscList:dense: OK (0.13s)
17.0 μs ± 1.5 μs, 4.0 KB allocated, 4 B copied, 7.0 MB peak memory
fromDistinctAscList:dense:fusion: OK (0.21s)
24.7 μs ± 1.5 μs, 291 KB allocated, 150 B copied, 7.0 MB peak memory
fromDistinctAscList:sparse: OK (0.19s)
45.9 μs ± 2.9 μs, 256 KB allocated, 16 KB copied, 7.0 MB peak memory
fromDistinctAscList:sparse:fusion: OK (0.26s)
62.3 μs ± 2.7 μs, 542 KB allocated, 17 KB copied, 7.0 MB peak memory
GHC 9.6.2, new
fromDistinctAscList:dense: OK (0.29s)
16.9 μs ± 766 ns, 6.0 KB allocated, 3 B copied, 7.0 MB peak memory, same as baseline
fromDistinctAscList:dense:fusion: OK (0.14s)
4.19 μs ± 361 ns, 6.0 KB allocated, 3 B copied, 7.0 MB peak memory, 83% less than baseline
fromDistinctAscList:sparse: OK (0.22s)
51.8 μs ± 2.8 μs, 383 KB allocated, 12 KB copied, 7.0 MB peak memory, 12% more than baseline
fromDistinctAscList:sparse:fusion: OK (0.19s)
46.0 μs ± 3.0 μs, 383 KB allocated, 12 KB copied, 7.0 MB peak memory, 26% less than baseline
As expected
- It is more efficient in both time and memory when there is fusion
- It is less efficient in memory when there is no fusion
However, it is worse in time for fromDistinctAscList:sparse, which is disappointing.
I'll try to figure out if it can be improved. Happy to hear if you have any ideas for that.
If it does not improve, I think this is still not a bad idea. The improvement on fromDistinctAscList:dense:fusion looks great. We could use rewrite rules to try to fuse and fall back to the current implementation, as we considered in #949.
I tweaked the implementation to store masks on the stack instead of prefixes, saving some calculations since the mask is what is compared every time in linkTop.
With it, the new timings are
GHC 9.2.5, new
fromDistinctAscList:dense: OK (0.21s)
13.2 μs ± 657 ns, 6.0 KB allocated, 3 B copied, 7.0 MB peak memory, same as baseline
fromDistinctAscList:dense:fusion: OK (0.14s)
4.24 μs ± 396 ns, 6.0 KB allocated, 3 B copied, 7.0 MB peak memory, 82% less than baseline
fromDistinctAscList:sparse: OK (0.22s)
55.3 μs ± 3.1 μs, 383 KB allocated, 12 KB copied, 7.0 MB peak memory, 7% more than baseline
fromDistinctAscList:sparse:fusion: OK (0.17s)
45.8 μs ± 2.9 μs, 384 KB allocated, 12 KB copied, 7.0 MB peak memory, 32% less than baseline
GHC 9.6.2, new
fromDistinctAscList:dense: OK (0.14s)
16.4 μs ± 1.4 μs, 6.0 KB allocated, 4 B copied, 7.0 MB peak memory, same as baseline
fromDistinctAscList:dense:fusion: OK (0.26s)
4.07 μs ± 177 ns, 6.0 KB allocated, 3 B copied, 7.0 MB peak memory, 83% less than baseline
fromDistinctAscList:sparse: OK (0.17s)
40.4 μs ± 2.8 μs, 383 KB allocated, 12 KB copied, 7.0 MB peak memory, 11% less than baseline
fromDistinctAscList:sparse:fusion: OK (0.15s)
34.9 μs ± 3.1 μs, 383 KB allocated, 12 KB copied, 7.0 MB peak memory, 43% less than baseline
Now it always takes less time vs current on 9.6.2, even for fromDistinctAscList:sparse 🎉
Though on 9.2.5 it's still slightly slower.
The implementation
fromDistinctAscList :: [Key] -> IntSet
fromDistinctAscList = finish . Foldable.foldl' next StateEmpty
where
next st x = case st of
StateEmpty -> State px bx Nada
State py by stk
| px == py -> State px (by .|. bx) stk
| otherwise -> State px bx (linkTop (branchMask px py) py (Tip py by) stk)
where
px = prefixOf x
bx = bitmapOf x
finish StateEmpty = Nil
finish (State pr br stk) = linkAll pr (Tip pr br) stk
{-# INLINE fromDistinctAscList #-}
linkTop :: Mask -> Prefix -> IntSet -> Stack -> Stack
linkTop !m !pr !tr (Push mlr tl stk)
| shorter m mlr = linkTop m pr (Bin (mask pr mlr) mlr tl tr) stk
linkTop m _ tl stk = Push m tl stk
linkAll :: Prefix -> IntSet -> Stack -> IntSet
linkAll !_ !tr Nada = tr
linkAll pr tr (Push mlr tl stk)
| mlr < 0 = Bin 0 mlr tr tl
| otherwise = linkAll pr (Bin (mask pr mlr) mlr tl tr) stk
data FromDistinctMonoState = State {-# UNPACK #-} !Prefix {-# UNPACK #-} !BitMap !Stack | StateEmpty
data Stack = Push {-# UNPACK #-} !Mask !IntSet !Stack | Nada
Do you want to open a PR?
I do, but I would prefer to be done with the similar Set/Map stuff before getting to this properly.