dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

Multi-join suffixes

Open eggrandio opened this issue 3 years ago • 3 comments

Hi,

I want to merge several data.frames with some common columns and append a suffix to the column names to keep track from where does the data for each column come from.

I can do it easily with the suffix term in the first full_join, but when I do the second join, no suffixes are added. I can rename the third data.frame so it has suffixes, but I wanted to know if there is another way of doing it using the suffix term.

Here is an example code:

x = data.frame(col1 = c("a","b","c"), col2 = 1:3, col3 = 1:3)
y = data.frame(col1 = c("b","c","d"), col2 = 4:6, col3 = 1:3)
z = data.frame(col1 = c("c","d","a"), col2 = 7:9, col3 = 1:3)

> df = full_join(x, y, by = "col1", suffix = c("_x","_y")) %>% 
  full_join(z, by = "col1", suffix = c("","_z")) 

> df
  col1 col2_x col3_x col2_y col3_y col2 col3
1    a      1      1     NA     NA    9    3
2    b      2      2      4      1   NA   NA
3    c      3      3      5      2    7    1
4    d     NA     NA      6      3    8    2

I was expecting that col2 and col3 from data.frame z would have a "_z" suffix. I have tried using empty suffixes while merging two data.frames and it works.

I can work around by renaming the columns in z before doing the second full_join, but in my real data I have several common columns, and if I wanted to merge more data.frames it would complicate the code. This is my expected output.

> colnames(z) = paste0(colnames(z),"_z")

> df = full_join(x, y, by = "col1", suffix = c("_x","_y")) %>% 
  full_join(z, by = c("col1"="col1_z"))

> df
  col1 col2_x col3_x col2_y col3_y col2_z col3_z
1    a      1      1     NA     NA      9      3
2    b      2      2      4      1     NA     NA
3    c      3      3      5      2      7      1
4    d     NA     NA      6      3      8      2

I have seen other similar problems in which adding an extra column to keep track of the source data.frame is used, but I was wondering why does not the suffix term work with multiple joins.

PS: If I keep the first suffix empty, I can add suffixes in the second join, but that will leave the col2 and col3 form x without suffix.

> df = full_join(x, y, by = "col1", suffix = c("","_y")) %>% 
  full_join(z, by = "col1", suffix = c("","_z"))

> df
  col1 col2 col3 col2_y col3_y col2_z col3_z
1    a    1    1     NA     NA      9      3
2    b    2    2      4      1     NA     NA
3    c    3    3      5      2      7      1
4    d   NA   NA      6      3      8      2

I might be missing why this is not working the way I expected, but in principle it should be very simple to add a suffix to the "newly added" columns.

Thanks!

eggrandio avatar Jan 20 '21 20:01 eggrandio

I'm not sure this is in scope for dplyr, but it's come up a couple of times, so I think it's worth considering. Maybe we just need a function that takes a list of data frames, and adds the suffixes for you.

Very very roughly:

add_suffixes <- function(...) {
  dfs <- list2(...)

  no_name <- is.na(names2(dfs)
  names(dfs)[no_name] <- letters[no_name]

  names <- unlist(map(dfs, tbl_vars))
  dups <- ...

  # loop over each df, renaming any duplicates with the suffix
  # return list of data frames.
}

hadley avatar Apr 19 '21 13:04 hadley

@eggrandio My solution when I have a similar issue is to add the first suffix (_x) only in the last join:

> df <- full_join(x, y, by = "col1", suffix = c("","_y")) %>% 
  full_join(z, by = "col1", suffix = c("_x","_z"))
> df
col1 col2_x col3_x col2_y col3_y col2_z col3_z
1    a      1      1     NA     NA      9      3
2    b      2      2      4      1     NA     NA
3    c      3      3      5      2      7      1
4    d     NA     NA      6      3      8      2

iago-pssjd avatar Apr 29 '21 15:04 iago-pssjd

A question related to this issue is how to specify the suffix option for multiple joins when they are done through purrr function reduce?

iago-pssjd avatar Oct 28 '21 13:10 iago-pssjd

There is good support for this in the dm package. Note how it also can print the dm_rename() calls needed to manually rename the columns to disambiguate:

library(conflicted)
library(dm)

a <- data.frame(col1 = c("a", "b", "c"))
x_tbl <- data.frame(col1 = c("a", "b", "c"), col2 = 1:3, col3 = 1:3)
y_tbl <- data.frame(col1 = c("b", "c", "d"), col2 = 4:6, col3 = 1:3)
z_tbl <- data.frame(col1 = c("c", "d", "a"), col2 = 7:9, col3 = 1:3)

dm <-
  dm(a, x_tbl, y_tbl, z_tbl) %>%
  dm_add_fk(a, col1, x_tbl, col1) %>%
  dm_add_fk(a, col1, y_tbl, col1) %>%
  dm_add_fk(a, col1, z_tbl, col1)

dm %>% dm_draw()

dm %>%
  dm_disambiguate_cols()
#> Renaming ambiguous columns: %>%
#>   dm_rename(a, col1.a = col1) %>%
#>   dm_rename(x_tbl, col1.x_tbl = col1) %>%
#>   dm_rename(x_tbl, col2.x_tbl = col2) %>%
#>   dm_rename(x_tbl, col3.x_tbl = col3) %>%
#>   dm_rename(y_tbl, col1.y_tbl = col1) %>%
#>   dm_rename(y_tbl, col2.y_tbl = col2) %>%
#>   dm_rename(y_tbl, col3.y_tbl = col3) %>%
#>   dm_rename(z_tbl, col1.z_tbl = col1) %>%
#>   dm_rename(z_tbl, col2.z_tbl = col2) %>%
#>   dm_rename(z_tbl, col3.z_tbl = col3)
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `a`, `x_tbl`, `y_tbl`, `z_tbl`
#> Columns: 10
#> Primary keys: 0
#> Foreign keys: 3

dm %>%
  dm_flatten_to_tbl(a, .join = full_join)
#> Renaming ambiguous columns: %>%
#>   dm_rename(a, col1.a = col1) %>%
#>   dm_rename(x_tbl, col1.x_tbl = col1) %>%
#>   dm_rename(x_tbl, col2.x_tbl = col2) %>%
#>   dm_rename(x_tbl, col3.x_tbl = col3) %>%
#>   dm_rename(y_tbl, col1.y_tbl = col1) %>%
#>   dm_rename(y_tbl, col2.y_tbl = col2) %>%
#>   dm_rename(y_tbl, col3.y_tbl = col3) %>%
#>   dm_rename(z_tbl, col1.z_tbl = col1) %>%
#>   dm_rename(z_tbl, col2.z_tbl = col2) %>%
#>   dm_rename(z_tbl, col3.z_tbl = col3)
#>   col1.a col2.x_tbl col3.x_tbl col2.y_tbl col3.y_tbl col2.z_tbl col3.z_tbl
#> 1      a          1          1         NA         NA          9          3
#> 2      b          2          2          4          1         NA         NA
#> 3      c          3          3          5          2          7          1
#> 4      d         NA         NA          6          3          8          2

dm %>%
  dm_disambiguate_cols() %>%
  dm_flatten_to_tbl(a, .join = full_join)
#> Renaming ambiguous columns: %>%
#>   dm_rename(a, col1.a = col1) %>%
#>   dm_rename(x_tbl, col1.x_tbl = col1) %>%
#>   dm_rename(x_tbl, col2.x_tbl = col2) %>%
#>   dm_rename(x_tbl, col3.x_tbl = col3) %>%
#>   dm_rename(y_tbl, col1.y_tbl = col1) %>%
#>   dm_rename(y_tbl, col2.y_tbl = col2) %>%
#>   dm_rename(y_tbl, col3.y_tbl = col3) %>%
#>   dm_rename(z_tbl, col1.z_tbl = col1) %>%
#>   dm_rename(z_tbl, col2.z_tbl = col2) %>%
#>   dm_rename(z_tbl, col3.z_tbl = col3)
#>   col1.a col2.x_tbl col3.x_tbl col2.y_tbl col3.y_tbl col2.z_tbl col3.z_tbl
#> 1      a          1          1         NA         NA          9          3
#> 2      b          2          2          4          1         NA         NA
#> 3      c          3          3          5          2          7          1
#> 4      d         NA         NA          6          3          8          2

Created on 2022-08-19 by the reprex package (v2.0.1)

krlmlr avatar Aug 18 '22 23:08 krlmlr

Given that dm has such a nice solution, I think we don't need to implement something in dplyr.

hadley avatar Aug 19 '22 11:08 hadley