containers Improve powerSet performance

Obviously, there's only so much we can do, but we can do some. The most obvious optimization is to use a version of insertMin that takes a singleton tree as an argument instead of an element. But I can't help wondering if we can do better. The problem is obviously $\Omega(2^n)$, and our solution is $O(2^n\log n)$. Can we close that gap by either improving our solution or proving a tight(er) lower bound?

Jan 08 '23 19:01 treeowl

@meooow25 , I started rambling about this in comments on your PR. Feel very welcome to think about it.

Jan 08 '23 19:01 treeowl

We can certainly improve the space requirement for the result to $O(2^n)$, though I don't yet know whether or we can do so without making the time blow up. In particular, each set in the result with $k \ge 3$ elements can be built by bining two sets in the result of equal or nearly equal size with an element. So the trick is to get the right sets and the right element where we need them when we need them.

Jan 08 '23 22:01 treeowl

Continuing to think out loud: imagine we build all the singletons first, then the 2-sets, then the 3-sets, etc. When we're working on the $k$-sets, we can hold on to the collections of $\lfloor k/2 \rfloor$-sets and $\lceil k/2 \rceil$-sets. The question is, how do we know which one to combine with which, and with which root element? I've drawn out the first few stages of this process, but it's hard to draw enough of it to look for patterns because the diagram grows exponentially! In case it helps anyone, here's a diagram for powerSet [1..5]. Each starting column represents a set size, and sets are lexicographically ordered from top to bottom.

[]
  [1]
    [1,2]
      [1,2,3]
        [1,2,3,4]
          [1,2,3,4,5]
        [1,2,3,5]
      [1,2,4]
        [1,2,4,5]
      [1,2,5]
    [1,3]
      [1,3,4]
        [1,3,4,5]
      [1,3,5]
    [1,4]
      [1,4,5]
    [1,5]
  [2]
    [2,3]
      [2,3,4]
        [2,3,4,5]
      [2,3,5]
    [2,4]
      [2,4,5]
    [2,5]
  [3]
    [3,4]
      [3,4,5]
    [3,5]
  [4]
    [4,5]
  [5]

Jan 08 '23 22:01 treeowl

For the next ... while ... I will assume that we're taking the power set of [1..n]. That can be generalized by tracking the position (in the original set) of the last element of a subset. Suppose we're building from columns j and k (which will either be identical or adjacent), so that each element we build will look like p ++ [e] ++ q, where p is from j and q is from k. We walk p through j from its beginning up until we reach a point where the largest element of p is too big to be able to choose e and still have enough room to take q. For each p, we need to be able to find the matching qs, and we have to do it efficiently. I don't think we can afford to walk k from the start for each p. So let's walk k once to break it up into lists by first element, and put those lists in an array indexed by first element. Now we can quickly find where the qs start for any given p.

The hardest remaining part seems to be fitting the collections of elements of each size together into a single Set. There's that wobbling back and forth in size. We have to know when to zig and when to zag.

Jan 09 '23 00:01 treeowl

@ekmett Do you have any ideas, either in the direction I'm pointing or another one?

Jan 09 '23 05:01 treeowl

I don't fully understand your last comment, but I can follow the ones before it. The diagram is also interesting.

Here has been my line of thought:

The current algorithm isn't useful in getting to $O(2^n)$ because of insertMin. If that were $O(1)$, we would be good. So this works in optimal time for lists, for instance, but not sets.

To construct sets in $O(1)$, we need left and right subtrees that are balanced wrt each other and the root element. We can have an array :: Array Int (Set a) where the index is a bitmask denoting indices of the set of elements in the original set. Then we can build the sets for a mask by finding the middle set bit, taking that as root, then indexing into the array with the lower and upper parts of the mask for left and right subtrees.

If we find the middle bit by looping over the positions, we get $O(n 2^n)$ time but $O(2^n)$ memory which is nice. If we instead binary search for the middle bit, given we have something like $O(1)$ popCount, we are back to $O(2^n \log n)$ time but still with $O(2^n)$ memory. I don't know of a way to find the middle bit in $O(1)$, if we could do that we would be golden.

Jan 09 '23 11:01 meooow25

It's actually not difficult.

middleBit 1 = 0
middleBit n
  | even n    = middleBit (n `div` 2) + 1
  | otherwise = middleBit (n - 1)

We can memoize for $O(1)$.

So that's at least one way to build the power set in $O(2^n)$. But I overlooked the fact that we need to convert from the mask order to the sorted order of sets. Trying to figure that out.

Jan 09 '23 15:01 meooow25

I don't understand what you mean about a middle bit.

A bit more explanation of where I was heading: For each suitable left child, we choose one or more suitable roots. For each such root, we choose one or more right children. For instance, suppose we're walking the collection of 2-sets and looking at the collection of 3-sets to form the collection of 6-sets (we could actually do it either way, depending on whether we want our trees to be heavier on the left or on the right). If we've walked up to [1,3] in the collection of 2-sets, then we need to consider, in turn, each root 4, 5, .... We can't use too big a root, because we need to have at least three greater elements in the right child, so we can stop considering roots once they get too big. Suppose we are at root 5, so we have a left subtree of [1,3] and a root of 5. We now need to use, as right children, all the 3-trees whose first element is at least 6. My array concept was intended to make it cheap to jump to those 3-trees, though I'm not sure if it's theoretically necessary.

Jan 09 '23 16:01 treeowl

Ah, the pattern is obvious now that I see it. Silly me! We always go up by one in length after a subset that does not include the maximum value. We always go down by one in length otherwise. So once we have all the subsets, getting them in order is actually easy. We can make them a set using fromDistinctAscList (or a version thereof that knows the size in advance). Is there a better way to form the set in the end? Maybe, but let's get to something that works first and think about constant factors (and GC effects) later.

Jan 09 '23 20:01 treeowl

Thanks for the explanation, I can see how you want to construct the sets, but I don't understand how it all ties together. Perhaps some code/pseudocode can make things clearer.

I don't understand what you mean about a middle bit.

The middle set bit of 0b11011001 is bit 4, for example (0-indexed). To construct the set for this mask, we take element 4 as the root, the set we constructed for 0b1001 as the left subtree and that for 0b11000000 as the right subtree.

Here is my idea implemented. Not benchmarked or optimized.

import Data.Bits (bit, xor, (.&.), popCount, finiteBitSize, countLeadingZeros)
import qualified Data.Array.Unboxed as UA
import qualified Data.List as L
import Data.Set.Internal

-- O(2^n)
powerSet :: forall a. Set a -> Set (Set a)
powerSet xs0 = fromDistinctAscList (L.map (sets UA.!) sortedMasks)
  where
    n = size xs0
    xs = UA.listArray (0, n - 1) (toList xs0) :: UA.Array Int a

    -- TODO: Use unboxed array for better constant factor
    middleBit :: UA.Array Int Int
    middleBit = UA.listArray (1, bit n - 1) (L.map f [1 .. bit n - 1])
      where
        f 1 = 0
        f msk
          | even msk            = (middleBit UA.! (msk `quot` 2)) + 1
          | even (popCount msk) = middleBit UA.! (msk - 1)
          | otherwise           = let x = middleBit UA.! (msk - 1)
                                      -- return the next lowest set bit after x
                                      msk' = msk .&. (bit x - 1)
                                  in finiteBitSize msk' - 1 - countLeadingZeros msk'

    sets :: UA.Array Int (Set a)
    sets = UA.listArray (0, bit n - 1) (L.map f [0 .. bit n - 1])
      where
        f 0 = Tip
        f msk = bin x (sets UA.! mskl) (sets UA.! mskr)
          where
            m = middleBit UA.! msk
            x = xs UA.! m
            mskl = msk .&. (bit m - 1)
            mskr = msk `xor` bit m `xor` mskl

    -- TODO: Use an unboxed array for better constant factor
    sortedMasks :: [Int]
    sortedMasks = 0 : L.foldl' step [] [n-1, n-2 .. 0]
      where
        step msks i = bit i : L.map (bit i +) msks ++ msks

Jan 10 '23 15:01 meooow25

It's fascinating to follow your discussion, but, forgive me, where's the use case?

Quite unscientifically, I checked the (approx. 36.000) *.hs that I got from unpacking $HOME/.cabal/packages/**/*.tar.gz (just what I happen to have lying around) - the only mention of powerSet is its implementation (and tests) in containers.

Another quick experiment shows

ghci> foldMap (Sum . length) $ S.powerSet $ S.fromList [1..20::Int]
Sum {getSum = 10485760}
(0.99 secs, 483,502,672 bytes)

ghci> foldMap (Sum . length) $ L.subsequences [1..20::Int]
Sum {getSum = 10485760}
(0.29 secs, 278,620,464 bytes)

I am reading this as: with current implementation, tree balancing doubles space, and triples time, w.r.t. no balancing (and no trees at all). And that's fine?

We really can't apply powerSet to anything substantially larger, because then it wouldn't fit in memory? (But it has to, since the outer tree, and all inner trees, are strict?)

Perhaps something Set a -> [Set a] (producing a lazy list) is more useful? The first candidate is map S.fromList . L.subsequences. (And their "maximally lazy" order is nice.)

Well, there's always the application as an exercise in a data structures course, so by all means, go ahead...

Jan 10 '23 16:01 jwaldmann

I agree that there's not much utility to this, personally I've just been treating this as a puzzle 😁

We really can't apply powerSet to anything substantially larger, because then it wouldn't fit in memory?

Probably so in practice. The formal limit for the input set size can be considered to be word_size - 1 since the output size must not exceed maxBound :: Int.

Jan 10 '23 16:01 meooow25

@jwaldmann This is absolutely just a puzzle. IIRC, I added powerSet to Data.Set because @ekmett thought it should be there, and offered the simple implementation we currently use. Something Set -> [Set] would certainly be accepted if there's a use for it and a good implementation. Since it's not a "fundamental operation on mathematical sets", I don't think it can slip in under the door like powerSet.

Going back to powerSet: Dropping from $O(2^n \log n)$ space to $O(2^n)$ space (with optimal sharing) should buy usability for a few extra input elements. On the time side, the logarithmic factor probably doesn't matter, so if someone has an $O(2^n)$ space/ $O(2^n \log n)$ time algorithm that actually runs faster than whatever $O(2^n)$ time algorithm we find, we can use the "theoretically slower" one instead.

Jan 10 '23 16:01 treeowl

@meooow25 Your implementation looks interesting, but I can't make head or tail of how it works. Could you explain? How does it achieve lexicographical order? ~~I also noticed a type error: unboxed arrays can't hold Sets.~~

Jan 10 '23 16:01 treeowl

Wait, now I'm confused... How does that typecheck?

Jan 10 '23 16:01 treeowl

@meooow25 Your implementation doesn't pass the test suite, unfortunately. I'm still very confused about how it passes the type checker.

Jan 10 '23 16:01 treeowl

To clarify, it doesn't pass the test suite once the test has been changed as in #892. That's (at least partly) because the subsets you produce are totally unbalanced—just lists.

Jan 10 '23 16:01 treeowl

How does that typecheck?

I didn't actually use unboxed arrays here (Array vs UArray). Perhaps the comment was confusing, I mean that as TODO: Use unboxed array...

the subsets you produce are totally unbalanced—just lists

Oops, the middle bit logic was at fault. Updated, but it's a little more complex now.

How does it achieve lexicographical order?

That's the sortedMasks part, which is the same as the current algorithm, only it constructs a list of bitmasks instead of a set of sets. Is there any other part you'd like me to explain?

Jan 10 '23 17:01 meooow25

Yes, the whole thing? I don't understand the shape of it. Could you maybe open a PR with lots of comments? I'm a comment maximalist.

Jan 10 '23 17:01 treeowl

I don't want to open a PR without actually testing that it is an improvement.

I'll try to explain with an example: Consider that the input set is {0,1,2}. Then we make an array sets to hold the power set, indexed by bitmasks 0b000 to 0b111. The bits in a mask indicate what elements are in the set. sets = array (0b000,0b111) [(0b000,{}), (0b001,{0}), (0b010,{1}), (0b011,{0,1}), ... (0b111, {0,1,2})] To build these sets in constant time for any mask, we find the middle set bit, take the corresponding element as root, and pick the left and right subtrees out of sets by indexing with the masks we want (as I described in the comment before). For example, sets ! 0b111 = bin 1 (sets ! 0b001) (sets ! 0b100) = bin 1 {0} {2} = {0,1,2} sets ! 0b101 = bin 2 (sets ! 0b001) (sets ! 0b000) = bin 2 {0} {} = {0,2} Now that we have all the sets, we need to convert them to lexicographical order. We use the current powerSet algorithm, which does build sets in the right order, to get the right order of bitmasks [0b000, 0b001, 0b011, 0b111, 0b101 ... 0b100]. Then we can pick the corresponding sets out of sets to get [{}, {0}, {0,1}, {0,1,2}, {0,2}, ... {2}], and build them into the output set with fromDistinctAscList.

Jan 10 '23 17:01 meooow25

@meooow25 FWIW, I would prefer to get a PR, even before perf testing. That'll sic CI on it, and make it easier for me (or others) to play with the implementation. I'm going to try to write an implementation based on my own idea to see how that compares, but I won't delay merging yours to do so. I suspect a lot of performance issues will come down to how badly we trash the generational hypothesis, which is mostly about how much long-lived intermediate structure we build. As I said, I don't understand your way yet. Mine (to the extend I've elaborated it so far) does not seem to have a good answer for that.

Jan 11 '23 20:01 treeowl

OK, puzzle time.

how much long-lived intermediate structure we build

As far as I understand, the only intermediate thing should be this "array of sets, indexed by bitmasks"? Just the array itself - its contents is permanent (trees that are reachable from the resulting `Set (Set a)`` ).

convert them to lexicographical order. ... We use the current powerSet algorithm,

You mean, an equivalent function that directly works on bitmasks? There must be something in Hacker's Delight, HAKMEM, or https://graphics.stanford.edu/~seander/bithacks.html Then use that for backpermute or similar. But that's two extra arrays. But that does not matter much (w.r.t. total amount of output data)?

Jan 12 '23 15:01 jwaldmann

It's fascinating to follow your discussion, but, forgive me, where's the use case?

The primary usecase I have (and what was the original motivation behind the instance) is that this algorithm is useful when constructing NFAs from DFAs. I have code for that that exists, even if its somewhat slow and off hackage buried in the ekmett/coda repo.

In my particular case I wound up using a Lazy version of the Data.Set library to achieve that, but it also led to the original code for computing sums and products of sets.

Jan 12 '23 15:01 ekmett

Hi,

constructing NFAs from DFAs

Sure - but one would want to build only the accessible part of the automaton? Something like https://gitlab.imn.htwk-leipzig.de/autotool/all0/-/blob/master/fa/src/Autolib/NFA/Det.hs#L30 (note: hull) Since you mention "lazy", perhaps that has the same effect.

Is there a (Haskell) high-performance automata library somewhere? Mine certainly is not. I think that LTL model checkers have such code (for omega automata), e.g., https://spot.lre.epita.fr/

[EDIT] this looks nice https://hackage.haskell.org/package/automata-0.1.0.0/docs/src/Automata.Internal.html#toDfsa (and also travels reachable states only, by my reading)

But anyway - building the powerset is a nice puzzle problem.

Jan 12 '23 16:01 jwaldmann

So I benchmarked my implementation, not very thoroughly, and it does worse than the current algorithm. Which is not too surprising.

As far as I understand, the only intermediate thing should be this "array of sets, indexed by bitmasks"?

I believe the middleBits array would also stick around as long as sets, since the sets in sets depend on it.

There must be something in Hacker's Delight, HAKMEM, or https://graphics.stanford.edu/~seander/bithacks.html

Thanks for the pointers, I just looked them up but they seem to have nothing about this. I did find a relevant post by searching: https://www.keithschwarz.com/binary-subsets/, but their algorithm is no better than ours.

I will hack on my algorithm's constant factor later and make a PR if it's promising. I'm also interested to see @treeowl's approach realized.

Jan 12 '23 17:01 meooow25

@meooow25 That's disappointing; I'd have expected an automatic performance boost from just reducing the result size. Two things to consider:

Make sortedMasks a right fold instead of a left fold. I'm not sure that'll help, but it probably won't hurt.
More significantly, don't use fromDistinctAscList. You're probably better off leaving out the empty set (to add later) and writing a custom function to build a full tree of a known size.

Jan 12 '23 17:01 treeowl

One more question: aside from gauge benchmarks, did you try informal ones that calculate just one power set? That should let you use bigger inputs.

Jan 12 '23 17:01 treeowl

a custom function to build a full tree of a known size.

this sounds useful in general? (fromDistinctAscListN, like https://hackage.haskell.org/package/vector-0.13.0.0/docs/Data-Vector.html#v:fromListN)

Jan 12 '23 19:01 jwaldmann

a custom function to build a full tree of a known size.

this sounds useful in general? (fromDistinctAscListN, like https://hackage.haskell.org/package/vector-0.13.0.0/docs/Data-Vector.html#v:fromListN)

I agree. How should that work? Make a canonical skew binary representation and then merge the digits?

Jan 12 '23 19:01 treeowl

well you said "writing a custom function" so you must have thought of something :-) but that might be easier for N = power of two (what we need here) than for the general case. But then - we could just build a Braun tree? (Okasaki JFP 1997 https://doi.org/10.1017/S0956796897002876 ). We couldn't be any more balanced than that?

Jan 12 '23 20:01 jwaldmann