dbplyr
dbplyr copied to clipboard
Feature request: implement `bind_rows()`
bind_rows()
in dplyr allows efficiently combining a list of data frames into a single data frame by wrapping vec_rbind
. This does not work for dbplyr tbl
objects.
This is a common pattern so it would be good to implement it without needing to collect the data locally. The SQL equivalent is UNION
, so a way to implement it would be to reduce the list using the existing union
or union_all
(I think dbplyr's implementation would match union_all) translation, e.g.
library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.2.3
library(purrr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# Create random slices of mtcars
tables <-
1:5 |>
map(function(n) {
tbl_name <- paste0("mtcars_", n)
copy_to(con, slice_sample(mtcars, n = 3), tbl_name)
tbl(con, tbl_name)
})
reduce(tables, union) |>
show_query()
#> <SQL>
#> SELECT *
#> FROM `mtcars_1`
#> UNION
#> SELECT *
#> FROM `mtcars_2`
#> UNION
#> SELECT *
#> FROM `mtcars_3`
#> UNION
#> SELECT *
#> FROM `mtcars_4`
#> UNION
#> SELECT *
#> FROM `mtcars_5`
Created on 2023-08-02 with reprex v2.0.2
This would be a straightforward way to implement it, and it produces a good translation that is similar to what someone would manually write. I think some checks would need to be done when using select *
to ensure that column order matches, as some SQL engines check only that column types align, not column names.
It would be good also to support the .id
argument to append a column with the list name for each item. This could be achieved by running mutate
on each item prior to the reduce operation.
Relevant: https://github.com/tidyverse/dplyr/issues/1045
A few notes:
-
bind_rows()
is not a generic so to support lazy tables work needs to be done in {dplyr} - It would be
union_all()
for sure, notunion()
- {dbplyr} already solves the issue of column alignment with
union_all()
, at least it did when tested now with Oracle
tables <-
1:5 |>
map(function(n) {
tbl_name <- paste0("mtcars_", n)
df <- mtcars
if (n == 2) df <- rev(mtcars)
copy_to(con, slice_sample(df, n = 3), tbl_name)
tbl(con, tbl_name)
})
reduce(tables, union_all) |>
show_query()
#> <SQL>
#> SELECT *
#> FROM `mtcars_1`
#> UNION ALL
#> SELECT
#> `mpg`,
#> `cyl`,
#> `disp`,
#> `hp`,
#> `drat`,
#> `wt`,
#> `qsec`,
#> `vs`,
#> `am`,
#> `gear`,
#> `carb`
#> FROM `mtcars_2`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_3`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_4`
#> UNION ALL
#> SELECT *
#> FROM `mtcars_5`
Reading through the dplyr tickets I think the choice is deliberate, though I wouldn't mind a bind_rows()
fix. An issue is that union_all()
fails on local tables if the number of columns is different, but not on data bases. So reduce(tables, union_all)
is a correct translation of bind_rows()
for data bases but we can't use it on local tables reliably so if you want to write code that works on both lazy and local you'll need something like :
if (lazy) {
table <- Reduce(union_all, tables)
} else {
table <- bind_rows(tables)
}
And that's not as pretty as it could be
I agree it would be very nice and I missed it a couple of times myself. Therefore, I opened an issue in dplyr: https://github.com/tidyverse/dplyr/issues/6905. I guess the discussion should continue there for now.