vctrs
vctrs copied to clipboard
Implement vec_arg_min and vec_arg_max
And use in min() and max() methods
I now think it would be useful to have:
# error if size 0, requires size >0 to be able to return a single location back
vec_arg_min(x)
vec_arg_max(x)
# Returns a size 1 value that represents the maximum/minimum value for a particular ptype
# - generic
# - default would cast Inf/-Inf to the type of x
# - clock would override this to return the max/min possible date-time values
# - data frames would map(x, vec_ptype_maximum) over the columns to get a 1 row result
vec_ptype_maximum(x)
vec_ptype_minimum(x)
vec_min(x)
vec_max(x)
# the above two are implemented as:
vec_min <- function(x) {
if (vec_size(x)) {
vec_slice(x, vec_arg_min(x))
} else {
vec_ptype_minimum(x)
}
}
I'm currently at a place where I'm using min() and max() in a "generic" way and would need them to even be generic over data frames, so vec_min/max() would be useful.
I'm not sure what the vec_ptype_minimum() of a character vector is. min(character()) returns NA_character_ and is documented to do so, but even the docs admit that that is strange. Maybe that should be an error in our vec_ptype_minimum.character method.
I also like that Python's min() function has a default argument
https://docs.python.org/3/library/functions.html#min
So maybe its something like:
vec_min <- function(x, ..., na_rm = FALSE, empty = NULL) {
check_dots_empty0(...)
if (na_rm) {
x <- vec_slice(x, !vec_equal_na(x))
}
if (is.null(empty)) {
empty <- vec_ptype_maximum(x)
} else {
empty <- vec_cast(empty, x)
vec_assert(empty, size = 1L)
}
if (vec_is_empty(x)) {
empty
} else {
vec_slice(x, vec_arg_min(x))
}
}
I think the empty argument would really come in handy with group_by() + summarize(). My wife just found herself in a situation where she wanted to compute the min date per group, but many of her dates were missing. She wanted to remove the missing dates and still compute the min on any data that was left, but she wanted to retain an NA if the group was entirely composed of NAs (rather than getting an Inf date).
suppressPackageStartupMessages({
library(dplyr)
library(vctrs)
})
#> Warning: package 'dplyr' was built under R version 4.1.2
df <- tibble(g = c(1, 1, 2), date = new_date(c(0, NA, NA)))
df_min <- df %>%
group_by(g) %>%
summarise(date = min(date, na.rm = TRUE))
#> Warning in min.default(structure(NA_real_, class = "Date"), na.rm = TRUE): no
#> non-missing arguments to min; returning Inf
# the print method lies!
df_min
#> # A tibble: 2 × 2
#> g date
#> <dbl> <date>
#> 1 1 1970-01-01
#> 2 2 NA
# EW!
unclass(df_min$date)
#> [1] 0 Inf
is.na(df_min$date)
#> [1] FALSE FALSE
# this was actually what she wanted:
df_min <- df %>%
group_by(g) %>%
summarise(date = vec_min(date, na_rm = TRUE, empty = NA))
vec_min(x, empty = 1L) would probably be useful for https://github.com/tidyverse/dplyr/issues/6167#issuecomment-1024533145
Should vec_arg_min() be vec_min_loc()?
Probably vec_locate_min() but yea i dont like the python style argmin naming scheme
Another example where this would be helpful https://github.com/tidyverse/dplyr/issues/6703