explorer
explorer copied to clipboard
`Series.equal` behaviour for arbitrarily sized Series
Hey folks,
I was doing some tests yesterday and realized that - different than I would expect - making comparisons using Series.equal for any RHS in which the length is not 1 or exactly the same length as LHS would raise an error.
Some context:
From documentation
#Same sized Series -> OK
iex > s1 = Explorer.Series.from_list([1, 2, 3])
iex > s2 = Explorer.Series.from_list([1, 2, 4])
iex > Explorer.Series.equal(s1, s2)
#Explorer.Series<
boolean[3]
[true, true, false]
>
#RHS with length 1 -> OK
iex > s = Explorer.Series.from_list([1, 2, 3])
iex > Explorer.Series.equal(s, 1)
#Explorer.Series<
boolean[3]
[true, false, false]
>
But if I try to compare with an arbitrarily sized Series, it won't work:
iex > s = Series.from_list([1,2,3])
iex > Series.equal(s, Series.from_list([2,3]))
** (ErlangError) Erlang error: :nif_panicked
(explorer 0.1.0-dev) Explorer.PolarsBackend.Native.s_eq(shape: (3,)
Series: '' [i64]
[
1
2
3
], shape: (2,)
Series: '' [i64]
[
2
3
])
(explorer 0.1.0-dev) lib/explorer/polars_backend/shared.ex:14: Explorer.PolarsBackend.Shared.apply_native/3
In R, the comparison occur on some of the matches, but raises a warning
> s <- c(1,2,3)
> s == c(1,2)
[1] TRUE TRUE FALSE
Warning message:
In s == c(1, 2) :
longer object length is not a multiple of shorter object length
> s == c(2,3)
[1] FALSE FALSE FALSE
Warning message:
In s == c(2, 3) :
longer object length is not a multiple of shorter object length
> s == c(1,3)
[1] TRUE FALSE FALSE
Warning message:
In s == c(1, 3) :
longer object length is not a multiple of shorter object length
I think that the behaviour in R is not clear, but I think we should keep the good practice of raising nice errors and tell the user whether the comparison MUST occur with equal sized Series or actually make the comparison and return the equivalent mask.
What do you think should happen? I would prefer to raise, but with a better error message.
What do you think should happen? I would prefer to raise, but with a better error message.
Sorry! I sent the message without the full content. It must be clearer now.
Hi @halian-vilela, nice to see you here!
I do not really think that we should add this type of behaviour. It makes the code unpredictable and forces the user to always check the Series' size before making comparisons. It's not one the best features that R has.
If you check the vctrs package (which is a modern tidyverse internal library that makes the work with vectors in R more consistent) you'll see that its vctrs::vec_equal function does not support comparison of vectors that are differently sized. I think that in this type of situation raising is better than just throwing an warning.
See:
x <- c(1, 2, 3)
y <- c(1, 2)
vctrs::vec_equal(y, x)
Error: Can't recycle `..1` (size 2) to match `..2` (size 3).
Run `rlang::last_error()` to see where the error occurred.
But if you try to compare vectors of equal sizes, it works perfectly:
a <- c(1, 2, 3)
b <- c(1, 2, 3)
vctrs::vec_equal(a, b)
[1] TRUE TRUE TRUE
EDIT: but yes, we need a better error message in this situation.
I agree, we need a better error message. That's the way to go.
Hi @kimjoaoun, happy to be here, thanks! 😄
Well, let me expand a little bit on what I observed about possible behaviours and how R deal with them.
We could divide these comparisons into 4 categories (I'm excluding, for brevity, obviously wrong arguments like missing arguments or different data types):
Case 1 (expected):
Equal Lengths
s <- c(1,2,3,10)
s1 <- c(1,2,3,10)
s == s1
[1] TRUE TRUE TRUE TRUE
Case 2 (expected):
n-length vs. 1-length
1-length will be recycled and compared element-wise
s <- c(1,2,3,10)
s1 <- 10
s == s1
[1] FALSE FALSE FALSE TRUE
Case 3 (not so expected but dealt relatively well):
n-length vs. m-length, where m < n and n is a multiple of m
m-lengthwill be recycled but it could yield unpredictable results
s <- c(1,2,3,10)
s1 <- c(1,10)
s == s1
[1] TRUE FALSE FALSE TRUE
One must keep in mind that
s1will be recycled toc(1,10,1,10)
Case 4 (worst scenario):
n-length vs. m-length, where m < n and n is not a multiple of m
m-lengthwill be recycled but there will be orphaned elements in LHS - a warning will be thrown.
s <- c(1,2,3,10)
s1 <- c(1,2,10)
s == s1
[1] TRUE TRUE FALSE FALSE
Warning message:
In s == s1 :
longer object length is not a multiple of shorter object length
As we've already said and @kimjoaoun demonstrated, we should not reproduce cases 3 and 4 and instead raise an error.
I would like to see an error message as simple as:
** (ArgumentError) cannot compare different sized series - make sure that RHS has length 1 or #{length(left)}
On the other hand, I have mixed feelings about the sentence "different sizes series", since RHS with length 1 makes it different from LHS. An alternative would be.
** (ArgumentError) series could not be compared - make sure that RHS has length 1 or #{length(left)}
Does that make sense? Should this kind of error also be applicable in all other element-wise comparison functions?
Hey guys, since that was labeled as a good first issue, I'm trying to implement the conditions to raise the correct errors, hope this is ok for you all.
So far so good, but there are some points I would like to argue before moving on:
-
Do you think it's clear enough to mention only RHS in the error? I mean, we will always have the LHS as the reference, so I think it's enough to ask the user to check the length of just RHS to allow the comparison.
-
I've just stumbled upon some inconsistencies on dealing with Series created from empty (all nil) lists.
When testing in the notebook provided by the library I get an error telling me that the Serie cannot be created from all nil lists:
Series.from_list([])
** (ArgumentError) cannot make a series from a list of all nils
(explorer 0.1.0-dev) lib/explorer/shared.ex:83: Explorer.Shared.check_types!/1
(explorer 0.1.0-dev) lib/explorer/series.ex:124: Explorer.Series.from_list/2
But, if I try that directly in the terminal (iex) in my local version forked from github I get a different result:
iex(61)> Series.from_list([])
#Explorer.Series<
float[0]
[]
I've checked the git blame of the file and just found out that it was changed yesterday, so I'm just making sure that this modification is 100% ok before moving on because I'll have to add another case to treat empty series.
If we can show the length for both left and right sides, then that’s even better. And yes, the new handling of empty lists is on purpose!
When testing in the notebook provided by the library I get an error telling me that the Serie cannot be created from all nil lists:
@halian-vilela this is probably due to cache. Please try to pass the force: true option to your Mix.install call.
Hi @philss ! Thanks for the info. I didn't try to reload the notebook yet because I'm using it as a source of how the original code was working. As I'm still messing around with the errors and how Series.equal/2 deals with lengths, it's nice to have a comparison and it was already there, loaded in the browser! 🤣
I'm planning to send a PR on this stuff today, I'm just polishing some last things and then we could discuss this further...
Hey folks, some update here.
I've menaged to treat Series.equal/2 length issue and add some behaviours for nicer error messages. But looking at all the other comparison functions, I realized that all of them will also need similar modifications to deal with error.
As this seems to be a major modification of all of them, I'd like to send a PR first so that you guys could review and give some feedbacks on my approach and then, once we all agree, I could move on and follow with the patching of the rest of them.
Would this work like this? If yes, I'll submit a PR dealing only with Series.equal/2 and if that's ok I'll make incremental commits to add the validation to all of the other functions.
Hi @nallwhy! Since you were investigating this area, you would like to send a PR to address this? What do you think?
@josevalim Sure!
I've menaged to treat Series.equal/2 length issue and add some behaviours for nicer error messages. But looking at all the other comparison functions, I realized that all of them will also need similar modifications to deal with error.
So is it about the common length rule(?) for ops between series right? not just about equal op?
Yes, it is a common rule related to broadcasting. If a single element, it applies to all. If more than one, the lengths should match otherwise it raises.
I see!