duckdb-r
duckdb-r copied to clipboard
Duckdb-R n_distinct translation doesn't allow management of NA values
Hello,
Duckdb-r has a dbplyr-like translation for the n_distinct() function that leverages the row() function in duckdb, opening-up more flexibilty. See here for the discussion : https://github.com/duckdb/duckdb-r/issues/110
And here for the function : R/backend-dbplyr__duckdb_connection.R
But this has consequences related to NA/NULL values in the table that imo make it less desirable than the previous translation.
Reprex below :
Try to count the number of distinct values in var2 for each var1 id It should be 2 for var1 = 1 and 1 for var1 = 2
library(dplyr)
library(dbplyr)
library(duckdb)
library(DBI)
my_df <-
data.frame(
var1 = c(1, 1, 1, 2, 2),
var2 = c("value1", "value2", "value1", "value1", NA_character_)
)
drv <- duckdb()
con <- dbConnect(drv)
dbWriteTable(con, "my_df", my_df)
# Should be 2 for var1 = 1 and 1 for var1 = 2
tbl(con, "my_df") |>
summarise(
test = n_distinct(
var2
),
.by = "var1"
)
Result :
tbl(con, "my_df") |>
+ summarise(
+ test = n_distinct(
+ var2
+ ),
+ .by = "var1"
+ )
# Source: SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
var1 test
<dbl> <dbl>
1 2 2
2 1 2
Handle NAs
No change if we try to handles NAs
# Try to handles NAs
tbl(con, "my_df") |>
summarise(
test = n_distinct(
var2[!is.na(var2)]
),
.by = "var1"
)
Results :
tbl(con, "my_df") |>
+ summarise(
+ test = n_distinct(
+ var2[!is.na(var2)]
+ ),
+ .by = "var1"
+ )
# Source: SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
var1 test
<dbl> <dbl>
1 2 2
2 1 2
# Raw SQL Without row() : correct result, even without NA handling
tbl(con, "my_df") |>
summarise(
test = sql(
"COUNT(DISTINCT var2)"
),
.by = "var1"
)
Result :
tbl(con, "my_df") |>
+ summarise(
+ test = sql(
+ "COUNT(DISTINCT var2)"
+ ),
+ .by = "var1"
+ )
# Source: SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
var1 test
<dbl> <dbl>
1 2 1
2 1 2
That's it. I don't know how to have the best of both world, where structs let us handle more complex cases but creates these weird situations with NAs that count as 1