duckplyr icon indicating copy to clipboard operation
duckplyr copied to clipboard

difference between `duckplyr` and `dbplyr`?

Open ywhcuhk opened this issue 1 year ago • 12 comments

I find it quite convenient to use duckdb as a backend of dplyr (through dbplyr). All you need to do is to specify a duckdb connection, and read data through duckdb's function. Then you can manipulate df using dplyr. e.g.,

library(duckdb)
library(dplyr)
con <- dbConnect(duckdb())

df = tbl(con, "read_parquet('yourdata.parquet')") 

So what can duckplyr do that dbplyr can't?

ywhcuhk avatar Apr 22 '24 19:04 ywhcuhk

Thanks, good question.

I've started to add the following content to README.Rmd but am not sure where it really belongs. A new FAQ section perhaps?

The goal of duckplyr is to provide a fully compatible drop-in replacement for dplyr that uses DuckDB as a backend for fast operation. DuckDB is an in-process OLAP database management system. Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way. If dplyr and duckplyr give different results on the same input data, this is considered a bug!

The operators, functions, and data types that currently cannot be translated to duckdb are executed in R. The set of supported operations is continuously expanding.

krlmlr avatar Apr 27 '24 19:04 krlmlr

This high-level blog post is a good intro too: https://posit.co/blog/duckplyr-dplyr-powered-by-duckdb/ .

krlmlr avatar Apr 29 '24 17:04 krlmlr

Thanks a lot for the clarification. Just to check, when you say:

Unlike dbplyr with duckdb, this package aims to support all R operators, functions, and data types, in a fully compatible way.

Does this mean that when run into a function that is not available in duckdb, the data will be collected and the processed in RAM? One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.

ywhcuhk avatar Apr 30 '24 13:04 ywhcuhk

True. You can set the DUCKPLYR_FORCE environment variable to "TRUE" to avoid this, but many operations will be unavailable then.

krlmlr avatar Apr 30 '24 16:04 krlmlr

One major advantage of using duckdb is the "out-of-memory" capabilities. If the data gets collected during the query, then one could run into the memory issue.

I came here to seek this specific point of clarification. My understanding (through intuition and reading docs) is that duckdplyr is strictly for the in-memory DuckDB database and it does not support out-of-core operations.

Is that safe to say? So, if one is working with larger than memory data, they should consider using duckdb("path/to/db-dir"). Is that correct?

JosiahParry avatar May 06 '24 23:05 JosiahParry

Here's how we're connecting to duckdb:

duckplyr:::create_default_duckdb_connection
#> function() {
#>   drv <- duckdb::duckdb()
#>   con <- DBI::dbConnect(drv)
#> 
#>   DBI::dbExecute(con, "set memory_limit='1GB'")
#>   DBI::dbExecute(con, paste0("pragma temp_directory='", tempdir(), "'"))
#> 
#>   duckdb$rapi_load_rfuns(drv@database_ref)
#> 
#>   for (i in seq_along(duckplyr_macros)) {
#>     sql <- paste0('CREATE MACRO "', names(duckplyr_macros)[[i]], '"', duckplyr_macros[[i]])
#>     DBI::dbExecute(con, sql)
#>   }
#> 
#>   con
#> }
#> <bytecode: 0x1135cd480>
#> <environment: namespace:duckplyr>

Created on 2024-05-07 with reprex v2.1.0

The memory is limited, we enable the temporary directory. We also support processing from and to files with duckplyr::*df_from_*() and duckplyr::df_to_parquet() .

What do you mean by "support out-of-core operations"?

krlmlr avatar May 07 '24 03:05 krlmlr

Thanks @krlmlr. I'm referring to this part of the DuckDB documentation

image

My understanding is that by only being able to use the default driver :memory:, the entire database is stored in RAM and does not allow for larger than memory workloads.

As I understand it, one of the main motivating points of DuckDB is this capability.

JosiahParry avatar May 07 '24 12:05 JosiahParry

This is interesting, I am also playing with both packages. I have noticed some difference in performance between these two libraries. I found out that dbplyr was a bit faster compared to duckplyr when querying parquet files on S3.

Did I miss something when exploring duckplyr ?

PMassicotte avatar May 07 '24 13:05 PMassicotte

Thanks, Philippe, interesting. With current duckplyr, I'm seeing that the filter is not pushed down to Parquet. Could that play a role? What does the plan look like for dbplyr?

options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(DBI)

con <- duckplyr:::get_default_duckdb_connection()

dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result e60e0 connection=e5160 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(
  con,
  "SET s3_region='auto';SET s3_endpoint='';"
)
#> <duckdb_result e63e0 connection=e5160 statement='SET s3_region='auto';SET s3_endpoint='';'>

out <- duckplyr::duckplyr_df_from_file(
  "s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet",
  "read_parquet",
  options = list(hive_partitioning = TRUE),
  class = class(tibble())
) |>
  filter(total_amount > 0L) |>
  filter(!is.na(passenger_count)) |>
  mutate(tip_pct = 100 * tip_amount / total_amount) |>
  summarise(
    avg_tip_pct = median(tip_pct),
    n = n(),
    .by = passenger_count
  ) |>
  arrange(desc(passenger_count))

out |>
  explain()
#> ┌───────────────────────────┐
#> │          ORDER_BY         │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          ORDERS:          │
#> │           #3 ASC          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> │     -(passenger_count)    │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             #0            │
#> │         median(#1)        │
#> │        count_star()       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │         tip_amount        │
#> │        total_amount       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │(r_base::>(total_amount, 0)│
#> │ AND (NOT ((passenger_count│
#> │   IS NULL) OR isnan(CAST  │
#> │(passenger_count AS DO...  │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 17623488       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       READ_PARQUET        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        total_amount       │
#> │      passenger_count      │
#> │         tip_amount        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 88117440       │
#> └───────────────────────────┘

Created on 2024-05-07 with reprex v2.1.0

krlmlr avatar May 07 '24 18:05 krlmlr

Josiah: duckplyr operates directly on data frames, it never creates persistent tables in duckdb's table store. The location of the database doesn't play that much of a role. The DBI equivalents are perhaps duckdb::duckdb_register() (create a view of the data frame in the database) vs. DBI::dbWriteTable() (write a table to the table store). You can write the results of a duckplyr operation directly to Parquet with df_to_parquet() without materializing it as a data frame. However, as soon as duckplyr encounters an operation it can't translate, it will fall back to data frames. Set the DUCKPLYR_FORCE environment variable to avoid this.

krlmlr avatar May 07 '24 18:05 krlmlr

Thank you @krlmlr for looking. This is what I have as plan:

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

con <- dbConnect(duckdb())

dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result a8340 connection=70be0 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(con, "SET s3_region='auto';SET s3_endpoint='';")
#> <duckdb_result 4dbc0 connection=70be0 statement='SET s3_region='auto';SET s3_endpoint='';'>

df <- tbl(
  con,
  "read_parquet('s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet')"
)

out <- df |>
  filter(total_amount > 0L) |>
  filter(!is.na(passenger_count)) |>
  mutate(tip_pct = 100 * tip_amount / total_amount) |>
  summarise(
    avg_tip_pct = median(tip_pct),
    n = n(),
    .by = passenger_count
  ) |>
  arrange(desc(passenger_count))

explain(out)
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> <SQL>
#> SELECT
#>   passenger_count,
#>   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY tip_pct) AS avg_tip_pct,
#>   COUNT(*) AS n
#> FROM (
#>   SELECT q01.*, (100.0 * tip_amount) / total_amount AS tip_pct
#>   FROM (FROM read_parquet('s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet')) q01
#>   WHERE (total_amount > 0) AND (NOT((passenger_count IS NULL)))
#> ) q01
#> GROUP BY passenger_count
#> ORDER BY passenger_count DESC
#> 
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │          ORDER_BY         │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          ORDERS:          │
#> │  q01.passenger_count DESC │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             #0            │
#> │     quantile_cont(#1)     │
#> │        count_star()       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │  (NOT (passenger_count IS │
#> │           NULL))          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 3524697        │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       READ_PARQUET        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │         tip_amount        │
#> │        total_amount       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │ Filters: total_amount>0.0 │
#> │ AND total_amount IS N...  │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 17623488       │
#> └───────────────────────────┘

dbDisconnect(con)

Created on 2024-05-07 with reprex v2.1.0

I think you are correct about filter() not being pushed down to the parquet files.

PMassicotte avatar May 07 '24 19:05 PMassicotte

Thanks, Philippe. Would you like to open a new issue with a reprex about the filter pushdown?

krlmlr avatar May 20 '24 18:05 krlmlr

When documenting, need to mention that we never generate SQL: #132.

krlmlr avatar Jun 30 '24 16:06 krlmlr