dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

`bind_rows()` for database tables

Open mgirlich opened this issue 1 year ago • 3 comments

It would be nice if bind_rows() would work for database tables.

library(dplyr, warn.conflicts = FALSE)
library(purrr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Create random slices of mtcars
tables <-
  1:5 |>
  map(function(n) {
    tbl_name <- paste0("mtcars_", n)
    copy_to(con, slice_sample(mtcars, n = 3), tbl_name)

    tbl(con, tbl_name)
  })
bind_rows(!!!tables)
#> Error in `bind_rows()`:
#> ! Argument 1 must be a data frame or a named atomic vector.
#> Backtrace:
#>     ▆
#>  1. └─dplyr::bind_rows(!!!tables)
#>  2.   └─rlang::abort(glue("Argument {i} must be a data frame or a named atomic vector."))

Created on 2023-08-07 with reprex v2.0.2

I know there were issues with the double dispatch and what happens if local tables and database tables are mixed. But I think it would be sufficient if either all inputs are local tables or database tables. This was also requested again in dbplyr: https://github.com/tidyverse/dbplyr/issues/1342.

mgirlich avatar Aug 07 '23 10:08 mgirlich

I made this change in the source and it passes all dplyr tests:

bind_rows <- function(..., .id = NULL) {
  dots <- list2(...)
  is_flattenable <- function(x) !is_named(x)
  if (length(dots) == 1 && is_bare_list(dots[[1]])) {
    dots <- dots[[1]]
  }
  dots <- list_flatten(dots, fn = is_flattenable)
  dots <- discard(dots, is.null)
  if (length(dots) == 0) {
    return(tibble())
  }
  UseMethod("bind_rows", dots[[1]])
}

#' @export
bind_rows.tbl_SQLiteConnection <- function(...) {
  message("hello there")
}

#' @export
bind_rows.LazyFrame <- function(...) {
  polars::pl$concat(..., how = "vertical")
}

#' @export
bind_rows.default <- function(..., .id = NULL) {
  # Current code of `bind_rows()`
}

Then, internally, each method could determine how to handle the ... (e.g error if not all items have the same class).

Here, I check that bind_rows() works fine with dplyr, some database connections (I don't know how it would bind rows so I just put a message), and polars:

library(dplyr, warn.conflicts = FALSE)
library(purrr)
library(polars)

### classic dplyr
bind_rows(tibble(mtcars), tibble(mtcars)) |> 
  head()
#> # A tibble: 6 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

### databases
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

tables <-
  1:5 |>
  map(function(n) {
    tbl_name <- paste0("mtcars_", n)
    copy_to(con, slice_sample(mtcars, n = 3), tbl_name)
    
    tbl(con, tbl_name)
  })
bind_rows(!!!tables)
#> hello there

### Polars LazyFrames
bind_rows(pl$LazyFrame(mtcars), pl$LazyFrame(mtcars))
#> [1] "polars LazyFrame naive plan: (run ldf$describe_optimized_plan() to see the optimized plan)"
#> UNION
#>   PLAN 0:
#>     DF ["mpg", "cyl", "disp", "hp"]; PROJECT */11 COLUMNS; SELECTION: "None"
#>   PLAN 1:
#>     DF ["mpg", "cyl", "disp", "hp"]; PROJECT */11 COLUMNS; SELECTION: "None"
#> END UNION

Would this implementation be enough or am I missing something?

etiennebacher avatar Nov 06 '23 11:11 etiennebacher

@DavisVaughan it would be great to know whether my suggested implementation makes sense. If so, I can prepare a PR for this

etiennebacher avatar Jun 14 '24 15:06 etiennebacher