purescript-arrays
purescript-arrays copied to clipboard
Use Ord by default for all set-like operations
We already made nub
use Ord by default in #91, but there are a few other unnecessarily quadratic functions in this library still which probably deserve the same treatment. I think these are union
, intersect
, and difference
.
As well as being quadratic, union
and difference
have a strange-seeming behaviour where duplicates are only preserved in the first argument:
union [0, 0] [1, 1] == [0, 0, 1]
See #56 for more info. We ought to be careful about changing this behaviour but I think we should consider it, since I think it’s very likely to trip people up and we will be making breaking changes to these functions anyway if we do this.
If we go ahead with this, the same changes should propagate through to the lists
library too.
duplicates are only preserved in the first argument
Yes, this seems really strange, even if it does match how things are done in Haskell. I'm curious if any projects rely on this asymmetric duplicate-preserving behavior.
If we match Set
behavior and remove all duplicates, then we get simpler code (just append
then nub
), and we can do property-based testing against Set
.
Another option is to thin-out these libraries a bit and recommend using Set
instead. Potential downsides are inconvenience, loss of ordering, and worse performance (would be interesting to benchmark). And we'd still need the Ord
-free versions, since those aren't Set
-compatible.
Perhaps we should open a discourse thread to discuss these proposals, since this affects multiple libraries.
The main reason I don't really want to just deprecate these functions and tell people to just use sets instead is that arrays have ordering, whereas sets don't, which means that the set-like Array/List operations can preserve ordering, and this might be useful in some cases.
This is the last issue in this repo that doesn't currently have a PR opened for it.
I'm thinking we should table all these container (List
, Array
, etc.) updates for 0.15.
I think it would be more efficient to do a systematic update of all our container libraries, where we start with a proposal of what we want our final result to be, rather than through a bunch of piecemeal PRs.
A few questions I have.
First, how is union
different from <|>
?
Second, in the following examples of difference
/\\
, what should the output be?
[1, 2, 3, 4, 3, 2] \\ [1]
-- I assume [2, 3, 4, 3, 2]
[1, 2, 3, 4, 3, 2] \\ [2]
[1, 2, 3, 4, 3, 2] \\ [2, 2]
-- It depends on whether each 2 in the right array
-- removes only one or all 2 element(s) in the left array
how is
union
different from<|>
?
union
does a set-like union of the two arrays. <|>
is the same as append
/<>
.
union
keeps any duplicates that already exist in the first array and removes duplicates from the second array. Keeping these duplicates seems like the lesser-of-evils, since you're forced to break one of these rules when applying set operations on non-sets:
a union b == b union a -- we break this rule
a union empty == a -- we follow this rule
a union a == a -- we follow this rule
a union (a <> a) == a -- we follow this rule
I think difference
behaves reasonably for arrays too, but it's still somewhat of an imperfect set-like operation.
Example input/output:
logShow $ union [1,2,3,3] [2,2,4,4]
logShow $ [1,2,3,3] <|> [2,2,4,4]
logShow $ [1,2,3,3] <> [2,2,4,4]
logShow $ [1, 2, 3, 4, 3, 2] \\ [1]
logShow $ [1, 2, 3, 4, 3, 2] \\ [2]
logShow $ [1, 2, 3, 4, 3, 2] \\ [2, 2]
[1,2,3,3,4]
[1,2,3,3,2,2,4,4]
[1,2,3,3,2,2,4,4]
[2,3,4,3,2]
[1,3,4,3,2]
[1,3,4,3]
https://try.ps.ai/?gist=aea5bdaf7669d2fd00c8d80cecc01b38
The main reason I don't really want to just deprecate these functions and tell people to just use sets instead is that arrays have ordering, whereas sets don't, which means that the set-like Array/List operations can preserve ordering, and this might be useful in some cases.
I think an ordered set would be even better. http://hackage.haskell.org/package/ordered-containers-0.2.2/docs/Data-Set-Ordered.html But no objections to keeping the set-like operations for arrays.
ordered set would be even better.
Isn't better
subjective to the problem one is solving?
Better if we assume these requirements:
- maintain ordering
- not having any weirdness with set-like operations
Of course, there can be other requirements, but I didn't see anything else mentioned.
a union b == b union a -- we break this rule a union empty == a -- we follow this rule a union a == a -- we follow this rule a union (a <> a) == a -- we follow this rule logShow $ union [1,2,3,3] [2,2,4,4]
Thanks @milesfrain. These examples helped! Pretty sure #203 is ready to go now.
Should other containers than sets behave like sets during operations like union
, intersect
etc?
I personally would assume that an-order preserving container like Array
would behave like this:
-- union is <>
[ 0, 0, 1, 1 ] `union` [ 2, 2, 1 ] == [ 0, 0, 1, 1 ] <> [ 2, 2, 1 ] == [ 0, 0, 1, 1, 2, 2, 1 ]
-- intersect - occurrences are matched
[ 0, 0, 1, 1, 2, 2 ] `intersect` [ 1, 2, 2 ] == [ 1, 2, 2 ]
-- difference - occurrences are matched like `intersect`
[ 0, 0, 1, 2, 2 ] `difference` [ 1, 1, 2 ] == [ 0, 0, 2 ]
All preserving order so union
isn't commutative. But the idea is that since these containers do preserve order, and allow duplicates, unlike the set container, then it makes sense that union
isn't commutative. And I'd say that this interpretation of the operations makes sense cause of this. But I'm no expert in math so... there's that too.
union is <>
If that is the case, then I don't think we should offer union
at all.
But there is another option for union
that builds on your matching proposal for intersect
and difference
:
union a b = difference a b <> intersect a b <> difference b a
a = [0, 0, 1, 1, 2, 2]
b = [1, 2, 2, 3]
difference a b == [0, 0, 1]
intersect a b == [1, 2, 2]
difference b a == [3]
union a b == [0, 0, 1, 1, 2, 2, 3]
union b a == [3, 1, 2, 2, 0, 0, 1]
sort(union a b) == sort(union b a)
We could include index tracking and recovery to preserve ordering, so:
union b a == [1, 2, 2, 3, 0, 0, 1]
Edit: I realize my above definition could be simplified to:
union a b = a <> difference b a
Which appears to be pretty close to the existing behavior of:
union a b = a <> nub (difference b a)
I think the nub-free version is better-behaved.
But my number-one preference is to eliminate these set-like functions from Array, which rely on nonintuitive matching, and add an ordered set collection.
I think there's a strong argument to be made that these functions only make sense for sets so I'm not opposed to removing them for array. I was just offering what I thought too be a better intuition for theirs functionality for this data type. And union
as you point out could be improved.
As for ordered set are you talking about this? https://pursuit.purescript.org/packages/purescript-ordered-set/0.4.0
As for ordered set are you talking about this? https://pursuit.purescript.org/packages/purescript-ordered-set/0.4.0
Oh, nice! Didn't find that on my initial survey.
As far as I'm aware, in all of these discussions, we haven't yet had one person indicate that they're currently using union, intersection, or difference, so I am starting to come around to removing them too.
We use intersection
and difference
in a few places. I don't think we use union
.
Ah, fantastic. I'm curious: Would you miss them if we removed them? Would you notice if their behaviour around the handling of duplicates changed? Would it be awkward to refactor that code to use the ordered-set package instead?
(Just to be clear, this isn't sarcastic: I'm pleased that we can hear from someone who is actually using these functions)
I think we are assuming there are no duplicates, we just need the ordering and convenience of Arrays, likely. I don't think we should remove them without having a recommended replacement. We have not had any issues with the existing implementation.
I think the discussion is more about the way they should behave, like if we were to take the concepts from scratch, what would be the "natural" behavior. Not that they can be used the way they are implemented as is, cause sure we can use them as they are, as long as they're consistent in some way we can find usages.
It strikes me that the current behavior is weird, that is removing duplicates from one array and not the other in some operations.
Don't you think that your particular case is better covered by purescript-ordered-set instead?
The problem with suggesting ordered-set as an alternative is that a) it’s a third party library and it’s hard to be sure how well maintained it’s going to be, and b) that package provides no O(1) conversion to Array, and converting from Array is necessarily O(n log n); if these functions work for arrays already no conversion is necessary. The obvious response to a) is that we should bring ordered-set into core but that’s not appealing to me either, as a huge part of the reason it’s taken us so long to get 0.14 released is that there is so much stuff in core already.
Since this is such a widely used module, and as long as we can’t identify one ideal way the Array functions should behave, I think my preference would now be to leave the behaviour alone, but change them to use Ord for better performance in the next round of breaking changes.
Following up with this, what needs to change?
It would require someone rewriting these functions in terms of Ord. Without anyone volunteering to do that, we should defer this.
Ah, I remember this now. I tried to resolve this issue in #203 but stopped because performance suffered greatly: https://github.com/purescript/purescript-arrays/pull/203#issuecomment-758013744