dtplyr
dtplyr copied to clipboard
Add keep argument to joins
Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.
Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.
Brief description of the problem
# insert reprex here
The basic idea of how to solve this is we can use the j
term of [.data.table
to select the variables we want. I think this will work for left/right/inner joins, but some extra steps are required for full joins which I don't think will work with dtplyr's lazy workflow.
data.table uses x[i, j, by]
syntax. Noting that in a join we're using the x
and i
arguments, we can select inside j
to implement keep
. In the examples below note the x.
and i.
prefixes to help refer to the data.tables we're selecting from. Also note I'm assuming suffix = c(".x", ".y")
for these examples.
left_join()
library(data.table)
df1 <- data.table(a = c("a", "b", "c"), b = 1:3)
df2 <- data.table(a = c("a", "b"), c = 1:2)
# left_join with keep = TRUE
df2[df1, .(a.x = i.a, b = i.b, a.y = x.a, c = x.c), on = .(a), allow.cartesian = TRUE]
#> a.x b a.y c
#> <char> <int> <char> <int>
#> 1: a 1 a 1
#> 2: b 2 b 2
#> 3: c 3 <NA> NA
In the case of left joins this will be useful without keep
because now we can drop the setcolorder()
and setnames()
calls and just directly use this selection:
# left_join with keep = FALSE
df2[df1, .(a = i.a, b = i.b, c = x.c), on = .(a), allow.cartesian = TRUE]
#> a b c
#> <char> <int> <int>
#> 1: a 1 1
#> 2: b 2 2
#> 3: c 3 NA
inner_join()
/right_join()
For inner/right joins we'll only need to use the selection syntax if keep = TRUE
.
I'll just use an example of a right join here, but we can use the exact same selection syntax for inner joins.
# right_join with keep = TRUE
df1[df2, .(a.x = x.a, b = x.b, a.y = i.a, c = i.c), on = .(a), allow.cartesian = TRUE]
#> a.x b a.y c
#> <char> <int> <char> <int>
#> 1: a 1 a 1
#> 2: b 2 b 2
full_join()
I don't think we can implement this in dtplyr
because of the lazy workflow. The only way I could figure out how to get it to work required an intermediate data frame that gets distinct join keys from df1 and df2. For this to work we would need to run collect()
on both inputs, which would be very expensive in the case of a long pipe chain.
library(data.table)
df1 <- data.table(a = c("a", "b", "c"), b = 1:3)
df2 <- data.table(a = c("a", "b"), c = 1:2)
unique_keys_df <- unique(rbindlist(list(
df1[, .(a)],
df2[, .(a)]
)))
# right_join with keep = TRUE
# Note: a__keep__ comes from df2.
## Join column is preserved from unique_keys_df for join in the next step
step_df <- df2[unique_keys_df, .(a__keep__ = x.a, c = x.c, a = i.a), on = .(a)]
# Another right_join onto step_df
# Don't need join cols from step_df,
## but need to rename join cols originally from df2.
## I appended these join cols with __keep__ to show which ones are needed
df1[step_df, .(a.x = x.a, b = x.b, a.y = a__keep__, c = i.c), on = .(a)]
#> a.x b a.y c
#> <char> <int> <char> <int>
#> 1: a 1 a 1
#> 2: b 2 b 2
#> 3: c 3 <NA> NA
I used a helper in tidytable called join_selection()
to create the selection - that can be found here. I think we can use it here, we would just need to fit it into @mgirlich's join workflow.