duckdb-r icon indicating copy to clipboard operation
duckdb-r copied to clipboard

Implement DISTINCT ON for dbplyr backend

Open cy-james-lee opened this issue 5 months ago • 0 comments

Currently, if distinct is used with .keep_all = TRUE, ROW_NUMBER is used. Which is normal dbplyr behavior for generics. But in my daily uses, I found DISTINCT ON is much faster than filtering a window function of a subquery.

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
conn1 <- DBI::dbConnect(duckdb())

duckdb_register(conn1, "iris", iris)
tbl_iris <- tbl(conn1, "iris")

# Current behavior
tbl_iris |> 
  distinct(
    Petal.Width, 
    Species,
    .keep_all = TRUE
  ) |> 
  show_query()
#> <SQL>
#> SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", Species
#> FROM (
#>   SELECT
#>     iris.*,
#>     ROW_NUMBER() OVER (PARTITION BY "Petal.Width", Species ORDER BY "Sepal.Length") AS col01
#>   FROM iris
#> ) q01
#> WHERE (col01 = 1)

dbDisconnect(conn1)
rm(conn1, tbl_iris)

Created on 2024-09-18 with reprex v2.1.1

cy-james-lee avatar Sep 18 '24 17:09 cy-james-lee