dbplyr
dbplyr copied to clipboard
postition of head in pipeline no longer leading to different sql
@hadley @mgirlich I notice that with the current version of dbplyr if I put head() in my pipeline it will return the same sql whether it was before or after a filter. I imagine this relates to the sql optimisation that was introduced in version 2.3.0. I think in this case though the older behaviour was more desirable. I'm not sure this was an intentional change or would be considered a bug?
Current behaviour
library(dbplyr)
#> Warning: package 'dbplyr' was built under R version 4.2.3
packageVersion("dbplyr")
#> [1] '2.5.0'
lf_postgres <- lazy_frame(a = TRUE, b = 1, c = 2, d = "zzzyzzz",
con = simulate_postgres())
lf_postgres |>
head(1) |>
dplyr::filter(d == "zzzyzzz") |>
dplyr::show_query()
#> <SQL>
#> SELECT `df`.*
#> FROM `df`
#> WHERE (`d` = 'zzzyzzz')
#> LIMIT 1
lf_postgres |>
dplyr::filter(d == "zzzyzzz") |>
head(1) |>
dplyr::show_query()
#> <SQL>
#> SELECT `df`.*
#> FROM `df`
#> WHERE (`d` = 'zzzyzzz')
#> LIMIT 1
Created on 2024-04-04 with reprex v2.0.2
Old behaviour
# remotes::install_version("dbplyr", version = "2.2.0", repos = "http://cran.us.r-project.org")
library(dbplyr)
packageVersion("dbplyr")
#> [1] '2.2.0'
lf_postgres <- lazy_frame(a = TRUE, b = 1, c = 2, d = "zzzyzzz",
con = simulate_postgres())
lf_postgres |>
head(1) |>
dplyr::filter(d == "zzzyzzz") |>
dplyr::show_query()
#> <SQL>
#> SELECT *
#> FROM (
#> SELECT *
#> FROM `df`
#> LIMIT 1
#> ) `q01`
#> WHERE (`d` = 'zzzyzzz')
lf_postgres |>
dplyr::filter(d == "zzzyzzz") |>
head(1) |>
dplyr::show_query()
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`d` = 'zzzyzzz')
#> LIMIT 1
Created on 2024-04-04 with reprex v2.0.2
Hmmmm, this is probably technically a bug but it's not as straightforward as it might appear because LIMIT
IIRC is generally not respected inside of subqueries.
Interesting. I suppose throwing a warning if head is included in a complex query might be appropriate?
The desired behaviour (well at least what I was thinking was desired) can be achieved using a filter with row_number. This made me thing that maybe I could use slice_head but then this is not supported - I'm not sure if it would be of interest to support it using something similar to how row_number is working below?
options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(dbplyr)
library(duckdb)
#> Loading required package: DBI
con <- DBI::dbConnect(duckdb(), path = ":memory:")
test_data <- data.frame(
person = c(1L, 2L),
date_1 = as.Date("2001-01-01")
)
db_test_data <- copy_to(con, test_data, overwrite = TRUE)
# head with unexpected behaviour
db_test_data |>
head(1) |>
dplyr::filter(person == 2L)
#> # Source: SQL [1 x 2]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.4.0/:memory:]
#> person date_1
#> <int> <date>
#> 1 2 2001-01-01
db_test_data |>
dplyr::filter(person == 2L) |>
head(1)
#> # Source: SQL [1 x 2]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.4.0/:memory:]
#> person date_1
#> <int> <date>
#> 1 2 2001-01-01
# filter with row_number provides expected behaviour
db_test_data |>
dplyr::filter(row_number() <= 1) |>
dplyr::filter(person == 2L)
#> # Source: SQL [0 x 2]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.4.0/:memory:]
#> # ℹ 2 variables: person <int>, date_1 <date>
db_test_data |>
dplyr::filter(row_number() <= 1) |>
dplyr::filter(person == 2L) |>
dplyr::show_query()
#> <SQL>
#> SELECT person, date_1
#> FROM (
#> SELECT test_data.*, ROW_NUMBER() OVER () AS col01
#> FROM test_data
#> ) q01
#> WHERE (col01 <= 1.0) AND (person = 2)
db_test_data |>
dplyr::filter(person == 2L) |>
dplyr::filter(row_number() <= 1)
#> # Source: SQL [1 x 2]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.4.0/:memory:]
#> person date_1
#> <int> <date>
#> 1 2 2001-01-01
db_test_data |>
dplyr::filter(person == 2L) |>
dplyr::filter(row_number() <= 1) |>
dplyr::show_query()
#> <SQL>
#> SELECT person, date_1
#> FROM (
#> SELECT test_data.*, ROW_NUMBER() OVER () AS col01
#> FROM test_data
#> WHERE (person = 2)
#> ) q01
#> WHERE (col01 <= 1.0)
# slice_head not supported, but could be?
db_test_data |>
dplyr::filter(row_number() <= 1) |>
dplyr::slice_head()
#> Error in `dplyr::slice_head()`:
#> ! `slice_head()` is not supported on database backends.
#> ℹ Please use `slice_min()` instead.
Created on 2024-04-05 with reprex v2.1.0
I'm pretty sure the reason that doesn't work is support for ROW_NUMBER()
without an explicit column isn't widespread? My memory is not great here, but I'm pretty sure I would've used that idea if worked broadly.