dplyr
dplyr copied to clipboard
`bind_rows()` for database tables
It would be nice if bind_rows()
would work for database tables.
library(dplyr, warn.conflicts = FALSE)
library(purrr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# Create random slices of mtcars
tables <-
1:5 |>
map(function(n) {
tbl_name <- paste0("mtcars_", n)
copy_to(con, slice_sample(mtcars, n = 3), tbl_name)
tbl(con, tbl_name)
})
bind_rows(!!!tables)
#> Error in `bind_rows()`:
#> ! Argument 1 must be a data frame or a named atomic vector.
#> Backtrace:
#> ▆
#> 1. └─dplyr::bind_rows(!!!tables)
#> 2. └─rlang::abort(glue("Argument {i} must be a data frame or a named atomic vector."))
Created on 2023-08-07 with reprex v2.0.2
I know there were issues with the double dispatch and what happens if local tables and database tables are mixed. But I think it would be sufficient if either all inputs are local tables or database tables. This was also requested again in dbplyr: https://github.com/tidyverse/dbplyr/issues/1342.
I made this change in the source and it passes all dplyr
tests:
bind_rows <- function(..., .id = NULL) {
dots <- list2(...)
is_flattenable <- function(x) !is_named(x)
if (length(dots) == 1 && is_bare_list(dots[[1]])) {
dots <- dots[[1]]
}
dots <- list_flatten(dots, fn = is_flattenable)
dots <- discard(dots, is.null)
if (length(dots) == 0) {
return(tibble())
}
UseMethod("bind_rows", dots[[1]])
}
#' @export
bind_rows.tbl_SQLiteConnection <- function(...) {
message("hello there")
}
#' @export
bind_rows.LazyFrame <- function(...) {
polars::pl$concat(..., how = "vertical")
}
#' @export
bind_rows.default <- function(..., .id = NULL) {
# Current code of `bind_rows()`
}
Then, internally, each method could determine how to handle the ...
(e.g error if not all items have the same class).
Here, I check that bind_rows()
works fine with dplyr
, some database connections (I don't know how it would bind rows so I just put a message), and polars
:
library(dplyr, warn.conflicts = FALSE)
library(purrr)
library(polars)
### classic dplyr
bind_rows(tibble(mtcars), tibble(mtcars)) |>
head()
#> # A tibble: 6 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
### databases
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
tables <-
1:5 |>
map(function(n) {
tbl_name <- paste0("mtcars_", n)
copy_to(con, slice_sample(mtcars, n = 3), tbl_name)
tbl(con, tbl_name)
})
bind_rows(!!!tables)
#> hello there
### Polars LazyFrames
bind_rows(pl$LazyFrame(mtcars), pl$LazyFrame(mtcars))
#> [1] "polars LazyFrame naive plan: (run ldf$describe_optimized_plan() to see the optimized plan)"
#> UNION
#> PLAN 0:
#> DF ["mpg", "cyl", "disp", "hp"]; PROJECT */11 COLUMNS; SELECTION: "None"
#> PLAN 1:
#> DF ["mpg", "cyl", "disp", "hp"]; PROJECT */11 COLUMNS; SELECTION: "None"
#> END UNION
Would this implementation be enough or am I missing something?
@DavisVaughan it would be great to know whether my suggested implementation makes sense. If so, I can prepare a PR for this