vctrs
vctrs copied to clipboard
Rename `vec_duplicate_id()`, and new `vec_duplicate_loc()`
Closes #763
If approved, this is what the changes suggested in #763 would look like.
Okay @lionel- we now have:
vec_duplicate_any(x)
vec_duplicate_all(x)
vec_duplicate_flg(x, first = FALSE)
vec_duplicate_loc(x, first = FALSE)
With a nice relationship of:
which(vec_duplicate_flg(x)) == vec_duplicate_loc(x)
which(vec_duplicate_flg(x, first = TRUE)) == vec_duplicate_loc(x, first = TRUE)
And soft deprecation of:
-
vec_duplicate_id()
->vec_first_loc()
-
vec_duplicate_detect()
->vec_duplicate_flg(first = TRUE)
vec_duplicate_flg(first = TRUE)
is also a decent bit faster than vec_duplicate_detect()
because I store and reuse the hashes rather than looking them up twice.
library(nycflights13)
library(vctrs)
set.seed(123)
flights_with_repeats <- sample(nrow(flights), 1e6, TRUE)
# old
bench::mark(vec_duplicate_detect(flights_with_repeats), iterations = 100)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt>
#> 1 vec_duplicate_detect(flights_with_repeats) 82.1ms 88.9ms 11.0 23.6MB
#> # … with 1 more variable: `gc/sec` <dbl>
# new
bench::mark(vec_duplicate_flg(flights_with_repeats, first = TRUE), iterations = 100)
#> # A tibble: 1 x 6
#> expression min median `itr/sec`
#> <bch:expr> <bch:> <bch:> <dbl>
#> 1 vec_duplicate_flg(flights_with_repeats, first = TRUE) 39.6ms 42.1ms 23.2
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
And if you don't want the first occurrence to be considered a duplicate, it is even faster:
bench::mark(vec_duplicate_flg(flights_with_repeats, first = FALSE), iterations = 100)
#> # A tibble: 1 x 6
#> expression min median `itr/sec`
#> <bch:expr> <bch:> <bch:> <dbl>
#> 1 vec_duplicate_flg(flights_with_repeats, first = FALSE) 25.6ms 30.6ms 32.4
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
FWIW I really don't like the flg
prefix
Postponing until 0.3.0 until we can figure out suffix naming for flg
It has come up elsewhere that we could also use the naming scheme:
-
vec_duplicate_detect
->vec_detect_duplicate()
-
vec_duplicate_id()
->vec_locate_first()
-
vec_duplicate_loc()
->vec_locate_duplicate()
(This nicely avoids vec_duplicate_flg()
)
I think this also lends itself well to the proposed change of vec_duplicate_any()
to vec_any_duplicate()
, and the addition of vec_all_duplicate()
and vec_all_equal()
(in the follow up PRs to this one).
Also should think about maybe:
-
vec_unique_loc()
->vec_locate_unique()
-
vec_unique_count()
->vec_count_unique()
-
vec_group_id()
->vec_identify_groups()
-
vec_group_loc()
->vec_locate_groups()
(maybevec_split_groups()
?) -
vec_group_rle()
->vec_rle()
? It doesn't seem to fit with groups any more.
I like this proposal.
vec_locate_first()
doesn't really suggest first duplicate to me. Do you mind making a googlesheet with a few of the proposals on it?
Tracking in https://docs.google.com/spreadsheets/d/1lSrVbq6CQICpX1IQcamyPEhkcw232Um9TLWEjV00Kfw/edit#gid=0
We will come back to these as we have an explicit need for them, or for the 1.0 release