vctrs vec_duplicate_detect(x, ignore = none/first/last)

vec_duplicate_detect(x, ignore = none/first/last)

Open DavisVaughan opened this issue 3 years ago • 6 comments

Inspired by the keep argument of https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

vec_duplicate_detect() current detects the first and subsequent duplicate values.
duplicated() detects only the subsequent duplicate values.
duplicated(fromLast = TRUE) detects the subsequent duplicate values, starting from the back

It would be neat if we could support all 3 variations (or at least the first 2), as it is sometimes a bit ambiguous as to what "duplicate" means here.

I could see an ignore = c("none", "first", "last") being useful here

It looks like it might just require modifying this loop in some way to also decrement p_val[hash] at each iteration https://github.com/r-lib/vctrs/blob/006af2ef1ccfe9caa2e79f4b9d07f8380e1ffae6/src/dictionary.c#L764-L766

Sep 02 '20 15:09 DavisVaughan

Oh I sort of added this already with the unmerged vec_duplicate_flg() (which was a bad alias for vec_duplicate_detect()) https://github.com/r-lib/vctrs/pull/764/files#diff-786a8b6fcb86825de6ab1d7cd2b2c6abR492

Sep 02 '20 15:09 DavisVaughan

It should be vec_detect_duplicate() right?

Sep 02 '20 15:09 lionel-

Eventually yes!

Sep 02 '20 15:09 DavisVaughan

I created an issue and some code long ago (I thought it was here, but apparently not) about better tools for duplicate management.

The piece that really feels missing to me is a function that returns labels reflecting the "duplicate groups". From that you can easily build all possible methods for dealing with the duplication. As a data analyst, the thing that frustrates me with only having TRUE/FALSE information re: the quality of being a duplicate is that for any given observation you don't know which other observation(s) it duplicates. That might be technically enough from a programming POV, but is not nearly enough when you are exploring / cleaning data. I was literally struggling with this over the weekend, trying to find duplicate talk submissions for rstudio::global(). Tackling that in the spreadsheet was more successful than with R, which offers no support for looking at "duplicate groups".

x  y   "duplicate group"
a  1    1
a  2    2
b  1    3
d  3    4
a  2    2
a  1    1
b  3    5
d  3    4

Sorry for barging in, if this is a very focused PR. But I have a lot of thoughts about duplicates! It feels like these groups have to be formed, at some low level, but there's no way to get at that information from R. It is thrown away and we just get TRUE/FALSE back.

Sep 02 '20 16:09 jennybc

We do have a few other tools that can do this!

library(vctrs)

df <- data.frame(
  x = c("a", "a", "b", "d", "a", "a", "b", "d"),
  y = c(1, 2, 1, 3, 2, 1, 3, 3)
)

# Location of the first occurrence of that value
# Better name - `vec_locate_initial()`
vec_duplicate_id(df)
#> [1] 1 2 3 4 2 1 7 4

# Identify groups of values in order of appearance
# Better name - `vec_identify_groups()`
vec_group_id(df)
#> [1] 1 2 3 4 2 1 5 4
#> attr(,"n")
#> [1] 5

^{Created on 2020-09-02 by the reprex package (v0.3.0.9001)}

Sep 02 '20 16:09 DavisVaughan

Related to #1452, since we could use vec_duplicate_detect(x, ignore = "first/last") there to match anyDuplicated(x) and anyDuplicated(x, fromLast = TRUE)

Sep 21 '21 14:09 DavisVaughan

vctrs vctrs copied to clipboard

vec_duplicate_detect(x, ignore = none/first/last)

vctrs
vctrs copied to clipboard