containers Data.Set.fromDistinctAscListN

just making a note that such a function (with N = length of second argument) could be useful, since an implementation can use N to determine the shape of the tree in advance, so there's no need to re-balance during construction.

see https://github.com/haskell/containers/issues/890#issuecomment-1380917641

I made some experiments here https://gitlab.imn.htwk-leipzig.de/waldmann/fdaln and the timing data looks very strange...

Jan 15 '23 18:01 jwaldmann

Also descending versions?

Jan 15 '23 19:01 treeowl

Your implementations all look weird to me. This is a deserialization problem.

fromDistinctAscendingListN :: Int -> [a] -> (Set a, [a])
fromDistinctAscendingListN n xs
  | s :!* remains <- fdalN n xs
  = (s, remains)

data LSP a b = !a :!* b
fdalN :: Int -> [a] -> LSP (Set a) [a]
fdalN n xs | n <= 0 = Tip :!* xs
fdalN 1 (x:xs) = singleton x :!* xs
fdalN n xs
  | start :*! (x : xs') <- fdalN (n `quot` 2) xs
  , end :*! xs'' <- fdalN ((n-1) `quot` 2) xs'
  = Bin n x start end :!* xs''
fdalN _ _ = error "List too short"

Jan 15 '23 19:01 treeowl

Mine looks a lot like a direct version of your continuation-passing ones. I'd expect it to be easier on the optimizer.

Jan 15 '23 19:01 treeowl

I added your version. The ST version is still faster.

ghc-906.0 fromDistinctAscListN (ST, []) : (0.763300835s,Sum {getSum = 50000005000000})
ghc-906.0 fromDistinctAscListN (deserialize) : (0.993581475s,Sum {getSum = 50000005000000})

I am puzzled by the timings. (so perhaps the method of measurement is not OK - or something else is not OK)

(Data.Set problem?) fromAscList is faster that fromDistinctAscList - but fromAscList calls fromDistinctAscList!

ghc-904.4 Data.Set.fromAscList : (0.584517719s,Sum {getSum = 50000005000000})
ghc-904.4 Data.Set.fromDistinctAscList : (0.987086451s,Sum {getSum = 50000005000000})

generally, performance fluctuates wildly with compiler version:

fromAscList

ghc-810.7 Data.Set.fromAscList : (0.87449644s,Sum {getSum = 50000005000000})
ghc-900.2 Data.Set.fromAscList : (0.864592133s,Sum {getSum = 50000005000000})
ghc-902.5 Data.Set.fromAscList : (0.567283365s,Sum {getSum = 50000005000000})
ghc-904.4 Data.Set.fromAscList : (0.576127222s,Sum {getSum = 50000005000000})
ghc-906.0 Data.Set.fromAscList : (1.789686293s,Sum {getSum = 50000005000000})

this is just foldMap Sum over a list (no Set at all):

ghc-810.7 Data.List : (0.714177161s,Sum {getSum = 50000005000000})
ghc-900.2 Data.List : (0.712193009s,Sum {getSum = 50000005000000})
ghc-900.2 Data.List : (0.721844117s,Sum {getSum = 50000005000000})
ghc-902.5 Data.List : (0.93300108s,Sum {getSum = 50000005000000})
ghc-904.4 Data.List : (0.141567047s,Sum {getSum = 50000005000000})    <------
ghc-906.0 Data.List : (0.698874088s,Sum {getSum = 50000005000000})

my ST version: best for <= 9.2

ghc-810.7 fromDistinctAscListN (ST, []) : (0.525096397s,Sum {getSum = 50000005000000})
ghc-900.2 fromDistinctAscListN (ST, []) : (0.551625999s,Sum {getSum = 50000005000000})
ghc-902.5 fromDistinctAscListN (ST, []) : (0.503619634s,Sum {getSum = 50000005000000})
ghc-904.4 fromDistinctAscListN (ST, []) : (0.904040714s,Sum {getSum = 50000005000000})
ghc-906.0 fromDistinctAscListN (ST, []) : (0.761270325s,Sum {getSum = 50000005000000})

Jan 15 '23 22:01 jwaldmann

I agree none of this makes sense. There's no reasonable world in which adding the overhead of STRef and such should improve matters. Is GHC doing something very silly? Was your system reasonably quiet? Were you getting pretty consistent results?

Jan 15 '23 22:01 treeowl

quiet: yes. consistent: yes - my run.sh does everything twice, see https://gitlab.imn.htwk-leipzig.de/waldmann/fdaln/-/blob/master/1.LOG

Jan 15 '23 22:01 jwaldmann

I do think your test is a bit unusual. First off, why sum the sets at all? Why not just force them to WHNF? They're totally strict structures. Secondly, why not use a common benchmark system? containers uses tasty-bench because it doesn't have a lot of dependencies. How did you choose to use a list in memory rather than one that's generated? Does that make a difference to the relative performance?

Jan 15 '23 23:01 treeowl

OK, with tasty-bench, some sanity is restored. Still, wild fluctations w.r.t. compiler version, see https://gitlab.imn.htwk-leipzig.de/waldmann/fdaln/-/blob/master/2.LOG

Jan 16 '23 03:01 jwaldmann

I haven't checked the linked implementations but noticed this issue so I thought I should mention, the amount of rebalancing work in Set.fromDistinctAscList is just $O(\log n)$. Which is admittedly not zero, but very little compared to the total $O(n)$ time. So fromDistinctAscListN vs fromDistinctAscList will come down to constant factor, and the benefit if any will likely be very small.

(See also #949)

Jun 16 '23 05:06 meooow25

containers containers copied to clipboard

Data.Set.fromDistinctAscListN

containers
containers copied to clipboard