dplyr
dplyr copied to clipboard
Multi-join suffixes
Hi,
I want to merge several data.frames with some common columns and append a suffix to the column names to keep track from where does the data for each column come from.
I can do it easily with the suffix term in the first full_join, but when I do the second join, no suffixes are added. I can rename the third data.frame so it has suffixes, but I wanted to know if there is another way of doing it using the suffix term.
Here is an example code:
x = data.frame(col1 = c("a","b","c"), col2 = 1:3, col3 = 1:3)
y = data.frame(col1 = c("b","c","d"), col2 = 4:6, col3 = 1:3)
z = data.frame(col1 = c("c","d","a"), col2 = 7:9, col3 = 1:3)
> df = full_join(x, y, by = "col1", suffix = c("_x","_y")) %>%
full_join(z, by = "col1", suffix = c("","_z"))
> df
col1 col2_x col3_x col2_y col3_y col2 col3
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2
I was expecting that col2 and col3 from data.frame z would have a "_z" suffix. I have tried using empty suffixes while merging two data.frames and it works.
I can work around by renaming the columns in z before doing the second full_join, but in my real data I have several common columns, and if I wanted to merge more data.frames it would complicate the code. This is my expected output.
> colnames(z) = paste0(colnames(z),"_z")
> df = full_join(x, y, by = "col1", suffix = c("_x","_y")) %>%
full_join(z, by = c("col1"="col1_z"))
> df
col1 col2_x col3_x col2_y col3_y col2_z col3_z
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2
I have seen other similar problems in which adding an extra column to keep track of the source data.frame is used, but I was wondering why does not the suffix term work with multiple joins.
PS: If I keep the first suffix empty, I can add suffixes in the second join, but that will leave the col2 and col3 form x without suffix.
> df = full_join(x, y, by = "col1", suffix = c("","_y")) %>%
full_join(z, by = "col1", suffix = c("","_z"))
> df
col1 col2 col3 col2_y col3_y col2_z col3_z
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2
I might be missing why this is not working the way I expected, but in principle it should be very simple to add a suffix to the "newly added" columns.
Thanks!
I'm not sure this is in scope for dplyr, but it's come up a couple of times, so I think it's worth considering. Maybe we just need a function that takes a list of data frames, and adds the suffixes for you.
Very very roughly:
add_suffixes <- function(...) {
dfs <- list2(...)
no_name <- is.na(names2(dfs)
names(dfs)[no_name] <- letters[no_name]
names <- unlist(map(dfs, tbl_vars))
dups <- ...
# loop over each df, renaming any duplicates with the suffix
# return list of data frames.
}
@eggrandio My solution when I have a similar issue is to add the first suffix (_x
) only in the last join:
> df <- full_join(x, y, by = "col1", suffix = c("","_y")) %>%
full_join(z, by = "col1", suffix = c("_x","_z"))
> df
col1 col2_x col3_x col2_y col3_y col2_z col3_z
1 a 1 1 NA NA 9 3
2 b 2 2 4 1 NA NA
3 c 3 3 5 2 7 1
4 d NA NA 6 3 8 2
A question related to this issue is how to specify the suffix
option for multiple joins when they are done through purrr
function reduce
?
There is good support for this in the dm package. Note how it also can print the dm_rename()
calls needed to manually rename the columns to disambiguate:
library(conflicted)
library(dm)
a <- data.frame(col1 = c("a", "b", "c"))
x_tbl <- data.frame(col1 = c("a", "b", "c"), col2 = 1:3, col3 = 1:3)
y_tbl <- data.frame(col1 = c("b", "c", "d"), col2 = 4:6, col3 = 1:3)
z_tbl <- data.frame(col1 = c("c", "d", "a"), col2 = 7:9, col3 = 1:3)
dm <-
dm(a, x_tbl, y_tbl, z_tbl) %>%
dm_add_fk(a, col1, x_tbl, col1) %>%
dm_add_fk(a, col1, y_tbl, col1) %>%
dm_add_fk(a, col1, z_tbl, col1)
dm %>% dm_draw()
dm %>%
dm_disambiguate_cols()
#> Renaming ambiguous columns: %>%
#> dm_rename(a, col1.a = col1) %>%
#> dm_rename(x_tbl, col1.x_tbl = col1) %>%
#> dm_rename(x_tbl, col2.x_tbl = col2) %>%
#> dm_rename(x_tbl, col3.x_tbl = col3) %>%
#> dm_rename(y_tbl, col1.y_tbl = col1) %>%
#> dm_rename(y_tbl, col2.y_tbl = col2) %>%
#> dm_rename(y_tbl, col3.y_tbl = col3) %>%
#> dm_rename(z_tbl, col1.z_tbl = col1) %>%
#> dm_rename(z_tbl, col2.z_tbl = col2) %>%
#> dm_rename(z_tbl, col3.z_tbl = col3)
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `a`, `x_tbl`, `y_tbl`, `z_tbl`
#> Columns: 10
#> Primary keys: 0
#> Foreign keys: 3
dm %>%
dm_flatten_to_tbl(a, .join = full_join)
#> Renaming ambiguous columns: %>%
#> dm_rename(a, col1.a = col1) %>%
#> dm_rename(x_tbl, col1.x_tbl = col1) %>%
#> dm_rename(x_tbl, col2.x_tbl = col2) %>%
#> dm_rename(x_tbl, col3.x_tbl = col3) %>%
#> dm_rename(y_tbl, col1.y_tbl = col1) %>%
#> dm_rename(y_tbl, col2.y_tbl = col2) %>%
#> dm_rename(y_tbl, col3.y_tbl = col3) %>%
#> dm_rename(z_tbl, col1.z_tbl = col1) %>%
#> dm_rename(z_tbl, col2.z_tbl = col2) %>%
#> dm_rename(z_tbl, col3.z_tbl = col3)
#> col1.a col2.x_tbl col3.x_tbl col2.y_tbl col3.y_tbl col2.z_tbl col3.z_tbl
#> 1 a 1 1 NA NA 9 3
#> 2 b 2 2 4 1 NA NA
#> 3 c 3 3 5 2 7 1
#> 4 d NA NA 6 3 8 2
dm %>%
dm_disambiguate_cols() %>%
dm_flatten_to_tbl(a, .join = full_join)
#> Renaming ambiguous columns: %>%
#> dm_rename(a, col1.a = col1) %>%
#> dm_rename(x_tbl, col1.x_tbl = col1) %>%
#> dm_rename(x_tbl, col2.x_tbl = col2) %>%
#> dm_rename(x_tbl, col3.x_tbl = col3) %>%
#> dm_rename(y_tbl, col1.y_tbl = col1) %>%
#> dm_rename(y_tbl, col2.y_tbl = col2) %>%
#> dm_rename(y_tbl, col3.y_tbl = col3) %>%
#> dm_rename(z_tbl, col1.z_tbl = col1) %>%
#> dm_rename(z_tbl, col2.z_tbl = col2) %>%
#> dm_rename(z_tbl, col3.z_tbl = col3)
#> col1.a col2.x_tbl col3.x_tbl col2.y_tbl col3.y_tbl col2.z_tbl col3.z_tbl
#> 1 a 1 1 NA NA 9 3
#> 2 b 2 2 4 1 NA NA
#> 3 c 3 3 5 2 7 1
#> 4 d NA NA 6 3 8 2
Created on 2022-08-19 by the reprex package (v2.0.1)
Given that dm has such a nice solution, I think we don't need to implement something in dplyr.