dplyr
dplyr copied to clipboard
Add a `ptype` argument to `between()` since we can't make it cast to the type of `x` at this time
The result of using between() in filtering is not equivalent to evaluating a conditional expression when working with ordered factors.
> (x <- data.frame(Visit = c("Discharge", "Month 3", "Month 6", "Month 12", "Month 20")))
Visit
1 Discharge
2 Month 3
3 Month 6
4 Month 12
5 Month 20
> class(x$Visit)
[1] "character"
Now let's make it an ordered factor with assumed levels (used later for other operations):
> (x$Visit <- factor(x$Visit, levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), ordered = TRUE))
[1] Discharge Month 3 Month 6 Month 12 Month 20
Levels: Baseline < Discharge < Month 3 < Month 6 < Month 12 < Month 20 < Premature
When filtering using the conjunction of conditions it works
> x %>% filter(Visit >= "Discharge" & Visit <= "Month 20")
Visit
1 Discharge
2 Month 3
3 Month 6
4 Month 12
5 Month 20
When using the convenient between(), it doesn't
> x %>% filter(between(Visit, "Discharge", "Month 20"))
Visit
1 Discharge
2 Month 12
3 Month 20
2 items are missing in the output.
R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)
packageVersion("dplyr") [1] ‘1.1.1’
I think this is by design. Look at the documentation:
x, left, and right are all cast to their common type before the comparison is made.
But if you want the behavior, it could be tedious, see the following:
library(dplyr)
#>
#> 载入程辑包:'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
x <- data.frame(
Visit = factor(
c("Discharge", "Month 3", "Month 6", "Month 12", "Month 20"),
levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"),
ordered = TRUE
)
)
x |>
filter(
between(
Visit,
factor("Discharge", levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), ordered = TRUE),
factor("Month 20", levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), ordered = TRUE)
)
)
#> Visit
#> 1 Discharge
#> 2 Month 3
#> 3 Month 6
#> 4 Month 12
#> 5 Month 20
Created on 2023-08-10 with reprex v2.0.2
The logic behind this is just type casting used by vctrs::vec_cast_common(). You could use data.table::between() instead:
library(dplyr)
#>
#> 载入程辑包:'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
x <- data.frame(
Visit = factor(
c("Discharge", "Month 3", "Month 6", "Month 12", "Month 20"),
levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"),
ordered = TRUE
)
)
x |>
filter(data.table::between(Visit, "Discharge", "Month 20"))
#> Visit
#> 1 Discharge
#> 2 Month 3
#> 3 Month 6
#> 4 Month 12
#> 5 Month 20
Created on 2023-08-10 with reprex v2.0.2
Thank you very much for the quick explanation, @psychelzh! This explains everything.
But, at the end of the day, although I know the source of this behaviour, this is even more frustrating.
Because when someone knows how the BETWEEN operator works in, say, SQL, where it's equivalent to the >= A AND <= B condition and expects such equivalence elsewhere (because why not?), the result I showed is confusing incredibly.
If only a warning or error, like "unequal types of both arguments/operands/variables/columns" were printed (the same way the binding functions or joins already do), it could save a lot of time wasted on investigating weird results (provided someone even noticed the result is incomplete! This case was trivial, but I can imagine a definitely more complex one with multiple conditions).
Please, @hadley, consider at least adding a warning, if between() cannot be adjusted for cases like this. In many (if not most) places tidyverse is very strict, to prevent confusing results, yet here, where I wouldn't expect any problem, it's far too liberal, giving unexpected results.
Without a warning or error, the explanation in the manual may not suffice - I guess some of people will be surprised even after reading it, because when we read about "common types" we usually think: "OK, number & number, character & character, factor & factor", not necessarily "factor_#0493" and "factor_#0493". And in case like that, the behaviour is really surprising at first sight.
I do agree with this! In fact, I think {vctrs} sometimes really gets too strict or restricted. More frustration also comes from https://github.com/tidyverse/tidyr/issues/1483. I think loose behavior like base R, such as casting number to character, is sometimes absolutely acceptable.
This ultimately comes down to the fact that between() takes the common type of x, left, and right rather than doing a directional cast of left and right to the type of x.
In vctrs, the common type of factor and character is character (the more general type), so all inputs end up getting cast to character before the comparison.
I tried making between() stricter by making it cast left and right to the type of x, but it ended up breaking too much in the revdeps so we rolled it back:
https://github.com/tidyverse/dplyr/issues/6478
https://github.com/tidyverse/dplyr/pull/6482
You could make your own wrapper around between() that helps with this:
library(dplyr, warn.conflicts = FALSE)
x <- tibble(Visit = c("Discharge", "Month 3", "Month 6", "Month 12", "Month 20"))
x$Visit <- factor(x$Visit, levels = c("Baseline", "Discharge", "Month 3", "Month 6", "Month 12", "Month 20", "Premature"), ordered = TRUE)
x
#> # A tibble: 5 × 1
#> Visit
#> <ord>
#> 1 Discharge
#> 2 Month 3
#> 3 Month 6
#> 4 Month 12
#> 5 Month 20
between_cast_to_x <- function(x, left, right) {
args <- vctrs::vec_cast_common(left = left, right = right, .to = x)
left <- args$left
right <- args$right
between(x, left, right)
}
between(x$Visit, "Discharge", "Month 12")
#> [1] TRUE FALSE FALSE TRUE FALSE
between_cast_to_x(x$Visit, "Discharge", "Month 12")
#> [1] TRUE TRUE TRUE TRUE FALSE
I think the only thing we can do is consider exposing a ptype argument to between() to make doing a directional cast a little simpler. We already do this in other places where we take the common type, like coalesce(), so I think adding that argument here makes sense if we don't feel like we can make between() cast to the type of x (its the next best solution).