purescript-arrays Use Ord by default for all set-like operations

We already made nub use Ord by default in #91, but there are a few other unnecessarily quadratic functions in this library still which probably deserve the same treatment. I think these are union, intersect, and difference.

As well as being quadratic, union and difference have a strange-seeming behaviour where duplicates are only preserved in the first argument:

union [0, 0] [1, 1] == [0, 0, 1]

See #56 for more info. We ought to be careful about changing this behaviour but I think we should consider it, since I think it’s very likely to trip people up and we will be making breaking changes to these functions anyway if we do this.

Dec 18 '20 05:12 hdgarrood

If we go ahead with this, the same changes should propagate through to the lists library too.

Dec 18 '20 06:12 hdgarrood

duplicates are only preserved in the first argument

Yes, this seems really strange, even if it does match how things are done in Haskell. I'm curious if any projects rely on this asymmetric duplicate-preserving behavior.

If we match Set behavior and remove all duplicates, then we get simpler code (just append then nub), and we can do property-based testing against Set.

Another option is to thin-out these libraries a bit and recommend using Set instead. Potential downsides are inconvenience, loss of ordering, and worse performance (would be interesting to benchmark). And we'd still need the Ord-free versions, since those aren't Set-compatible.

Perhaps we should open a discourse thread to discuss these proposals, since this affects multiple libraries.

Dec 18 '20 19:12 milesfrain

The main reason I don't really want to just deprecate these functions and tell people to just use sets instead is that arrays have ordering, whereas sets don't, which means that the set-like Array/List operations can preserve ordering, and this might be useful in some cases.

Dec 19 '20 09:12 hdgarrood

This is the last issue in this repo that doesn't currently have a PR opened for it.

Dec 30 '20 18:12 JordanMartinez

I'm thinking we should table all these container (List, Array, etc.) updates for 0.15. I think it would be more efficient to do a systematic update of all our container libraries, where we start with a proposal of what we want our final result to be, rather than through a bunch of piecemeal PRs.

Dec 30 '20 20:12 milesfrain

A few questions I have.

First, how is union different from <|>?

Second, in the following examples of difference/\\, what should the output be?

[1, 2, 3, 4, 3, 2] \\ [1]
-- I assume [2, 3, 4, 3, 2]

[1, 2, 3, 4, 3, 2] \\ [2]
[1, 2, 3, 4, 3, 2] \\ [2, 2]
-- It depends on whether each 2 in the right array
-- removes only one or all 2 element(s) in the left array

Dec 31 '20 20:12 JordanMartinez

how is union different from <|>?

union does a set-like union of the two arrays. <|> is the same as append/<>. union keeps any duplicates that already exist in the first array and removes duplicates from the second array. Keeping these duplicates seems like the lesser-of-evils, since you're forced to break one of these rules when applying set operations on non-sets:

a union b == b union a  -- we break this rule
a union empty == a      -- we follow this rule
a union a == a          -- we follow this rule
a union (a <> a) == a   -- we follow this rule

I think difference behaves reasonably for arrays too, but it's still somewhat of an imperfect set-like operation.

Example input/output:

  logShow $ union [1,2,3,3] [2,2,4,4]
  logShow $ [1,2,3,3] <|> [2,2,4,4]
  logShow $ [1,2,3,3] <> [2,2,4,4]
  logShow $ [1, 2, 3, 4, 3, 2] \\ [1]
  logShow $ [1, 2, 3, 4, 3, 2] \\ [2]
  logShow $ [1, 2, 3, 4, 3, 2] \\ [2, 2]

[1,2,3,3,4]
[1,2,3,3,2,2,4,4]
[1,2,3,3,2,2,4,4]
[2,3,4,3,2]
[1,3,4,3,2]
[1,3,4,3]

https://try.ps.ai/?gist=aea5bdaf7669d2fd00c8d80cecc01b38

Dec 31 '20 22:12 milesfrain

The main reason I don't really want to just deprecate these functions and tell people to just use sets instead is that arrays have ordering, whereas sets don't, which means that the set-like Array/List operations can preserve ordering, and this might be useful in some cases.

I think an ordered set would be even better. http://hackage.haskell.org/package/ordered-containers-0.2.2/docs/Data-Set-Ordered.html But no objections to keeping the set-like operations for arrays.

Dec 31 '20 22:12 milesfrain

ordered set would be even better.

Isn't better subjective to the problem one is solving?

Dec 31 '20 22:12 JordanMartinez

Better if we assume these requirements:

maintain ordering
not having any weirdness with set-like operations

Of course, there can be other requirements, but I didn't see anything else mentioned.

Dec 31 '20 23:12 milesfrain

a union b == b union a  -- we break this rule
a union empty == a      -- we follow this rule
a union a == a          -- we follow this rule
a union (a <> a) == a   -- we follow this rule

logShow $ union [1,2,3,3] [2,2,4,4]

Thanks @milesfrain. These examples helped! Pretty sure #203 is ready to go now.

Dec 31 '20 23:12 JordanMartinez

Should other containers than sets behave like sets during operations like union, intersect etc?

I personally would assume that an-order preserving container like Array would behave like this:

 -- union is <>
[ 0, 0, 1, 1 ] `union` [ 2, 2, 1 ] == [ 0, 0, 1, 1 ] <> [ 2, 2, 1 ] == [ 0, 0, 1, 1, 2, 2, 1 ]

-- intersect - occurrences are matched
[ 0, 0, 1, 1, 2, 2 ] `intersect` [ 1, 2, 2 ] == [ 1, 2, 2 ]

-- difference - occurrences are matched like `intersect`
[ 0, 0, 1, 2, 2 ] `difference` [ 1, 1, 2 ] == [ 0, 0, 2 ]

All preserving order so union isn't commutative. But the idea is that since these containers do preserve order, and allow duplicates, unlike the set container, then it makes sense that union isn't commutative. And I'd say that this interpretation of the operations makes sense cause of this. But I'm no expert in math so... there's that too.

Jan 14 '21 11:01 razcore-rad

union is <>

If that is the case, then I don't think we should offer union at all.

But there is another option for union that builds on your matching proposal for intersect and difference:

union a b = difference a b <> intersect a b <> difference b a

a = [0, 0, 1, 1, 2, 2]
b = [1, 2, 2, 3]

difference a b == [0, 0, 1]
intersect a b == [1, 2, 2]
difference b a == [3]

union a b == [0, 0, 1, 1, 2, 2, 3]
union b a == [3, 1, 2, 2, 0, 0, 1]
sort(union a b) == sort(union b a)

We could include index tracking and recovery to preserve ordering, so:

union b a == [1, 2, 2, 3, 0, 0, 1]

Edit: I realize my above definition could be simplified to:

union a b = a <> difference b a

Which appears to be pretty close to the existing behavior of:

union a b = a <> nub (difference b a)

I think the nub-free version is better-behaved.

But my number-one preference is to eliminate these set-like functions from Array, which rely on nonintuitive matching, and add an ordered set collection.

Jan 14 '21 20:01 milesfrain

I think there's a strong argument to be made that these functions only make sense for sets so I'm not opposed to removing them for array. I was just offering what I thought too be a better intuition for theirs functionality for this data type. And union as you point out could be improved.

As for ordered set are you talking about this? https://pursuit.purescript.org/packages/purescript-ordered-set/0.4.0

Jan 15 '21 06:01 razcore-rad

As for ordered set are you talking about this? https://pursuit.purescript.org/packages/purescript-ordered-set/0.4.0

Oh, nice! Didn't find that on my initial survey.

Jan 15 '21 07:01 milesfrain

As far as I'm aware, in all of these discussions, we haven't yet had one person indicate that they're currently using union, intersection, or difference, so I am starting to come around to removing them too.

Jan 15 '21 17:01 hdgarrood

We use intersection and difference in a few places. I don't think we use union.

Jan 15 '21 17:01 natefaubion

Ah, fantastic. I'm curious: Would you miss them if we removed them? Would you notice if their behaviour around the handling of duplicates changed? Would it be awkward to refactor that code to use the ordered-set package instead?

Jan 15 '21 17:01 hdgarrood

(Just to be clear, this isn't sarcastic: I'm pleased that we can hear from someone who is actually using these functions)

Jan 15 '21 17:01 hdgarrood

I think we are assuming there are no duplicates, we just need the ordering and convenience of Arrays, likely. I don't think we should remove them without having a recommended replacement. We have not had any issues with the existing implementation.

Jan 15 '21 17:01 natefaubion

I think the discussion is more about the way they should behave, like if we were to take the concepts from scratch, what would be the "natural" behavior. Not that they can be used the way they are implemented as is, cause sure we can use them as they are, as long as they're consistent in some way we can find usages.

It strikes me that the current behavior is weird, that is removing duplicates from one array and not the other in some operations.

Don't you think that your particular case is better covered by purescript-ordered-set instead?

Jan 16 '21 08:01 razcore-rad

The problem with suggesting ordered-set as an alternative is that a) it’s a third party library and it’s hard to be sure how well maintained it’s going to be, and b) that package provides no O(1) conversion to Array, and converting from Array is necessarily O(n log n); if these functions work for arrays already no conversion is necessary. The obvious response to a) is that we should bring ordered-set into core but that’s not appealing to me either, as a huge part of the reason it’s taken us so long to get 0.14 released is that there is so much stuff in core already.

Since this is such a widely used module, and as long as we can’t identify one ideal way the Array functions should behave, I think my preference would now be to leave the behaviour alone, but change them to use Ord for better performance in the next round of breaking changes.

Jan 16 '21 12:01 hdgarrood

Following up with this, what needs to change?

Mar 25 '22 02:03 JordanMartinez

It would require someone rewriting these functions in terms of Ord. Without anyone volunteering to do that, we should defer this.

Mar 25 '22 03:03 natefaubion

Ah, I remember this now. I tried to resolve this issue in #203 but stopped because performance suffered greatly: https://github.com/purescript/purescript-arrays/pull/203#issuecomment-758013744

Mar 25 '22 15:03 JordanMartinez

purescript-arrays purescript-arrays copied to clipboard

Use Ord by default for all set-like operations

purescript-arrays
purescript-arrays copied to clipboard