duckplyr
                                
                                 duckplyr copied to clipboard
                                
                                    duckplyr copied to clipboard
                            
                            
                            
                        duckplyr with remote tables in a duckdb connection
Hello,
Trying out duckplyr with a remote table results in the following error. Is duckplyr compatible with remote tables in a duckdb connection? Not sure if the idea is to use dbplyr when working with remote tables, and duckplyr when working with data in memory. Feedback much appreciated.
Thank you. -Ed
Error:
con = DBI::dbConnect(duckdb::duckdb(), "foo.db") #db on disk
DBI::dbWriteTable(con, name = "iris", value = iris)
dplyr::tbl(con, "iris") |> duckplyr::as_duckplyr_df()
Error in `duckplyr::as_duckplyr_df()`:
! Must pass a plain data frame or a tibble to `as_duckplyr_df()`.
Traceback:
1. duckplyr::as_duckplyr_df(dplyr::tbl(con, "iris"))
2. abort("Must pass a plain data frame or a tibble to `as_duckplyr_df()`.")
3. signal_abort(cnd, .file)
Thanks, good question. I see two components:
- Use in a new project
- Use in an existing project
New project
No need to bother with connections. Start with data frames, use as_duckplyr_df(), from that point on the queries will be processed by duckdb where possible.
Internally, duckplyr uses a DBI connection to duckdb, but this is not meant to be accessed by the user. There is currently no way to specify the location of the database file for this internal connection. Do you think we need an option for this to avoid keeping everything in memory?
Existing project
Because of the internal DBI connection, it is difficult to mix dbplyr code and duckplyr code. I wonder how to make this more seamless. Ideally, as_duckplyr_df() would work for dbplyr tbl objects.
Sketch (with a dummy relational object and unexported functions):
con <- DBI::dbConnect(duckdb::duckdb(), "foo.db") # db on disk
DBI::dbWriteTable(con, name = "iris", value = iris)
tbl <-
  dplyr::tbl(con, "iris") |>
  dplyr::filter(Petal.Length <= 1.2)
tbl |> dplyr::show_query()
#> <SQL>
#> SELECT iris.*
#> FROM iris
#> WHERE ("Petal.Length" <= 1.2)
# Dummy rel object
rel <- duckdb:::rel_from_df(con, data.frame(a = integer()))
duckdb:::rel_sql(rel, dbplyr::sql_render(tbl))
#> # A tibble: 4 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          4.3         3            1.1         0.1 setosa 
#> 2          5.8         4            1.2         0.2 setosa 
#> 3          4.6         3.6          1           0.2 setosa 
#> 4          5           3.2          1.2         0.2 setosa
Created on 2023-11-20 with reprex v2.0.2
Hello @krlmlr,
Do you think we need an option for this to avoid keeping everything in memory?
Yes I think this would be helpful.
Perhaps an optional con param in as_duckplyr_df() could do the trick? In the event that a user wants to use a duckplyr_df on disk, a con of type duckdb_connection could be passed. Also, to allow multiple duckplyr_df in the same con, it may also make sense to have a name param to name each in-memory object for their remote table counterpart. I believe name, con follow the conventions in DBI.
We allow multiple duckplyr_df objects in the same con already.
The opposite may be a bit trickier. @Tmonster: are there any obstacles combining multiple relational objects that were created from different connections, e.g., with joins?
I was thinking about an option to configure the default connection too, but passing the connection object may be the easiest.
I'm a bit confused. For my clarity, remote means a duckdb table in a different duckdb database file right? Or connection I guess? Since duckplyr maintains it's own connection to a duckdb database?
I'll have to look into it, but combining relational objects from two different duckdb connections might be difficult.
I think it might be easier to integrate the attach/detach functionality that duckdb has.
If a user has other existing duckdb database files and they want to use duckplyr functionality without calling as.data.frame() , then we could expose some kind of duckplyr_attach_db() function.
Would this work?
If we can't mix and match relational objects from different connections, we should check that they are the same for joins and other operations.
We'll also take a look into connections and database storage modes.
For joins we already check if they are the same. See join_relation.cpp in src/main/relation/join_relation.cpp in the duckdb main code
I came here with the same question as OP. as_duckplyr_tibble is now depreciated and as_duckdb_tibble does not seem to accept connections, anyway that I can see. Can we use on disk duckdb databases with duckplyr?
@vorpalvorpal Solve available here: https://github.com/tidyverse/duckplyr/issues/139#issuecomment-2544693687