dbplyr icon indicating copy to clipboard operation
dbplyr copied to clipboard

Feature request: implement `bind_rows()`

Open bairdj opened this issue 11 months ago • 2 comments

bind_rows() in dplyr allows efficiently combining a list of data frames into a single data frame by wrapping vec_rbind. This does not work for dbplyr tbl objects.

This is a common pattern so it would be good to implement it without needing to collect the data locally. The SQL equivalent is UNION, so a way to implement it would be to reduce the list using the existing union or union_all (I think dbplyr's implementation would match union_all) translation, e.g.

library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.2.3
library(purrr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Create random slices of mtcars
tables <-
  1:5 |>
  map(function(n) {
    tbl_name <- paste0("mtcars_", n)
    copy_to(con, slice_sample(mtcars, n = 3), tbl_name)

    tbl(con, tbl_name)
  })

reduce(tables, union) |>
  show_query()
#> <SQL>
#> SELECT *
#> FROM `mtcars_1`
#> UNION
#> SELECT *
#> FROM `mtcars_2`
#> UNION
#> SELECT *
#> FROM `mtcars_3`
#> UNION
#> SELECT *
#> FROM `mtcars_4`
#> UNION
#> SELECT *
#> FROM `mtcars_5`

Created on 2023-08-02 with reprex v2.0.2

This would be a straightforward way to implement it, and it produces a good translation that is similar to what someone would manually write. I think some checks would need to be done when using select * to ensure that column order matches, as some SQL engines check only that column types align, not column names.

It would be good also to support the .id argument to append a column with the list name for each item. This could be achieved by running mutate on each item prior to the reduce operation.

bairdj avatar Aug 02 '23 10:08 bairdj

Relevant: https://github.com/tidyverse/dplyr/issues/1045

A few notes:

  • bind_rows() is not a generic so to support lazy tables work needs to be done in {dplyr}
  • It would be union_all() for sure, not union()
  • {dbplyr} already solves the issue of column alignment with union_all(), at least it did when tested now with Oracle
tables <-
  1:5 |>
  map(function(n) {
    tbl_name <- paste0("mtcars_", n)
    df <- mtcars
    if (n == 2) df <- rev(mtcars)
    copy_to(con, slice_sample(df, n = 3), tbl_name)
    tbl(con, tbl_name)
  })
reduce(tables, union_all) |>
  show_query()
#> <SQL>
#> SELECT *
#> FROM `mtcars_1`
#> UNION ALL
#> SELECT
#>   `mpg`,
#>   `cyl`,
#>   `disp`,
#>   `hp`,
#>   `drat`,
#>   `wt`,
#>   `qsec`,
#>   `vs`,
#>   `am`,
#>   `gear`,
#>   `carb`
#> FROM `mtcars_2`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_3`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_4`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_5`

Reading through the dplyr tickets I think the choice is deliberate, though I wouldn't mind a bind_rows() fix. An issue is that union_all() fails on local tables if the number of columns is different, but not on data bases. So reduce(tables, union_all) is a correct translation of bind_rows() for data bases but we can't use it on local tables reliably so if you want to write code that works on both lazy and local you'll need something like :

  if (lazy) {
    table <- Reduce(union_all, tables)
  } else {
    table <- bind_rows(tables)
  }

And that's not as pretty as it could be

moodymudskipper avatar Aug 04 '23 14:08 moodymudskipper

I agree it would be very nice and I missed it a couple of times myself. Therefore, I opened an issue in dplyr: https://github.com/tidyverse/dplyr/issues/6905. I guess the discussion should continue there for now.

mgirlich avatar Aug 07 '23 10:08 mgirlich