vctrs icon indicating copy to clipboard operation
vctrs copied to clipboard

Rename `vec_duplicate_id()`, and new `vec_duplicate_loc()`

Open DavisVaughan opened this issue 4 years ago • 6 comments

Closes #763

If approved, this is what the changes suggested in #763 would look like.

DavisVaughan avatar Jan 16 '20 15:01 DavisVaughan

Okay @lionel- we now have:

vec_duplicate_any(x)
vec_duplicate_all(x)
vec_duplicate_flg(x, first = FALSE)
vec_duplicate_loc(x, first = FALSE)

With a nice relationship of:

which(vec_duplicate_flg(x)) == vec_duplicate_loc(x)
which(vec_duplicate_flg(x, first = TRUE)) == vec_duplicate_loc(x, first = TRUE)

And soft deprecation of:

  • vec_duplicate_id() -> vec_first_loc()
  • vec_duplicate_detect() -> vec_duplicate_flg(first = TRUE)

vec_duplicate_flg(first = TRUE) is also a decent bit faster than vec_duplicate_detect() because I store and reuse the hashes rather than looking them up twice.

library(nycflights13)
library(vctrs)

set.seed(123)

flights_with_repeats <- sample(nrow(flights), 1e6, TRUE)

# old

bench::mark(vec_duplicate_detect(flights_with_repeats), iterations = 100)
#> # A tibble: 1 x 6
#>   expression                                    min median `itr/sec` mem_alloc
#>   <bch:expr>                                 <bch:> <bch:>     <dbl> <bch:byt>
#> 1 vec_duplicate_detect(flights_with_repeats) 82.1ms 88.9ms      11.0    23.6MB
#> # … with 1 more variable: `gc/sec` <dbl>

# new

bench::mark(vec_duplicate_flg(flights_with_repeats, first = TRUE), iterations = 100)
#> # A tibble: 1 x 6
#>   expression                                               min median `itr/sec`
#>   <bch:expr>                                            <bch:> <bch:>     <dbl>
#> 1 vec_duplicate_flg(flights_with_repeats, first = TRUE) 39.6ms 42.1ms      23.2
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

And if you don't want the first occurrence to be considered a duplicate, it is even faster:

bench::mark(vec_duplicate_flg(flights_with_repeats, first = FALSE), iterations = 100)
#> # A tibble: 1 x 6
#>   expression                                                min median `itr/sec`
#>   <bch:expr>                                             <bch:> <bch:>     <dbl>
#> 1 vec_duplicate_flg(flights_with_repeats, first = FALSE) 25.6ms 30.6ms      32.4
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

DavisVaughan avatar Jan 17 '20 19:01 DavisVaughan

FWIW I really don't like the flg prefix

hadley avatar Jan 17 '20 20:01 hadley

Postponing until 0.3.0 until we can figure out suffix naming for flg

DavisVaughan avatar Jan 21 '20 15:01 DavisVaughan

It has come up elsewhere that we could also use the naming scheme:

  • vec_duplicate_detect -> vec_detect_duplicate()

  • vec_duplicate_id() -> vec_locate_first()

  • vec_duplicate_loc() -> vec_locate_duplicate()

(This nicely avoids vec_duplicate_flg())

I think this also lends itself well to the proposed change of vec_duplicate_any() to vec_any_duplicate(), and the addition of vec_all_duplicate() and vec_all_equal() (in the follow up PRs to this one).

Also should think about maybe:

  • vec_unique_loc() -> vec_locate_unique()

  • vec_unique_count() -> vec_count_unique()

  • vec_group_id() -> vec_identify_groups()

  • vec_group_loc() -> vec_locate_groups() (maybe vec_split_groups()?)

  • vec_group_rle() -> vec_rle()? It doesn't seem to fit with groups any more.

DavisVaughan avatar Mar 18 '20 15:03 DavisVaughan

I like this proposal.

lionel- avatar Mar 18 '20 16:03 lionel-

vec_locate_first() doesn't really suggest first duplicate to me. Do you mind making a googlesheet with a few of the proposals on it?

hadley avatar Mar 18 '20 17:03 hadley

Tracking in https://docs.google.com/spreadsheets/d/1lSrVbq6CQICpX1IQcamyPEhkcw232Um9TLWEjV00Kfw/edit#gid=0

We will come back to these as we have an explicit need for them, or for the 1.0 release

DavisVaughan avatar Sep 28 '22 14:09 DavisVaughan