duckdb-r
duckdb-r copied to clipboard
Implement DISTINCT ON for dbplyr backend
Currently, if distinct
is used with .keep_all = TRUE
, ROW_NUMBER
is used. Which is normal dbplyr
behavior for generics. But in my daily uses, I found DISTINCT ON
is much faster than filtering a window function of a subquery.
library(duckdb)
#> Loading required package: DBI
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
conn1 <- DBI::dbConnect(duckdb())
duckdb_register(conn1, "iris", iris)
tbl_iris <- tbl(conn1, "iris")
# Current behavior
tbl_iris |>
distinct(
Petal.Width,
Species,
.keep_all = TRUE
) |>
show_query()
#> <SQL>
#> SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", Species
#> FROM (
#> SELECT
#> iris.*,
#> ROW_NUMBER() OVER (PARTITION BY "Petal.Width", Species ORDER BY "Sepal.Length") AS col01
#> FROM iris
#> ) q01
#> WHERE (col01 = 1)
dbDisconnect(conn1)
rm(conn1, tbl_iris)
Created on 2024-09-18 with reprex v2.1.1