vctrs
vctrs copied to clipboard
vec_duplicate_detect(x, ignore = none/first/last)
Inspired by the keep
argument of https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html
-
vec_duplicate_detect()
current detects the first and subsequent duplicate values. -
duplicated()
detects only the subsequent duplicate values. -
duplicated(fromLast = TRUE)
detects the subsequent duplicate values, starting from the back
It would be neat if we could support all 3 variations (or at least the first 2), as it is sometimes a bit ambiguous as to what "duplicate" means here.
I could see an ignore = c("none", "first", "last")
being useful here
It looks like it might just require modifying this loop in some way to also decrement p_val[hash]
at each iteration
https://github.com/r-lib/vctrs/blob/006af2ef1ccfe9caa2e79f4b9d07f8380e1ffae6/src/dictionary.c#L764-L766
Oh I sort of added this already with the unmerged vec_duplicate_flg()
(which was a bad alias for vec_duplicate_detect()
)
https://github.com/r-lib/vctrs/pull/764/files#diff-786a8b6fcb86825de6ab1d7cd2b2c6abR492
It should be vec_detect_duplicate()
right?
Eventually yes!
I created an issue and some code long ago (I thought it was here, but apparently not) about better tools for duplicate management.
The piece that really feels missing to me is a function that returns labels reflecting the "duplicate groups". From that you can easily build all possible methods for dealing with the duplication. As a data analyst, the thing that frustrates me with only having TRUE/FALSE
information re: the quality of being a duplicate is that for any given observation you don't know which other observation(s) it duplicates. That might be technically enough from a programming POV, but is not nearly enough when you are exploring / cleaning data. I was literally struggling with this over the weekend, trying to find duplicate talk submissions for rstudio::global(). Tackling that in the spreadsheet was more successful than with R, which offers no support for looking at "duplicate groups".
x y "duplicate group"
a 1 1
a 2 2
b 1 3
d 3 4
a 2 2
a 1 1
b 3 5
d 3 4
Sorry for barging in, if this is a very focused PR. But I have a lot of thoughts about duplicates! It feels like these groups have to be formed, at some low level, but there's no way to get at that information from R. It is thrown away and we just get TRUE/FALSE
back.
We do have a few other tools that can do this!
library(vctrs)
df <- data.frame(
x = c("a", "a", "b", "d", "a", "a", "b", "d"),
y = c(1, 2, 1, 3, 2, 1, 3, 3)
)
# Location of the first occurrence of that value
# Better name - `vec_locate_initial()`
vec_duplicate_id(df)
#> [1] 1 2 3 4 2 1 7 4
# Identify groups of values in order of appearance
# Better name - `vec_identify_groups()`
vec_group_id(df)
#> [1] 1 2 3 4 2 1 5 4
#> attr(,"n")
#> [1] 5
Created on 2020-09-02 by the reprex package (v0.3.0.9001)
Related to #1452, since we could use vec_duplicate_detect(x, ignore = "first/last")
there to match anyDuplicated(x)
and anyDuplicated(x, fromLast = TRUE)