duckdb-r
duckdb-r copied to clipboard
Duckdb-R n_distinct translation doesn't allow management of NA values
Hello,
Duckdb-r has a dbplyr-like translation for the n_distinct() function that leverages the row() function in duckdb, opening-up more flexibilty. See here for the discussion : https://github.com/duckdb/duckdb-r/issues/110
And here for the function : R/backend-dbplyr__duckdb_connection.R
But this has consequences related to NA/NULL values in the table that imo make it less desirable than the previous translation.
Reprex below :
Try to count the number of distinct values in var2 for each var1 id It should be 2 for var1 = 1 and 1 for var1 = 2
library(dplyr)
library(dbplyr)
library(duckdb)
library(DBI)
my_df <-
data.frame(
var1 = c(1, 1, 1, 2, 2),
var2 = c("value1", "value2", "value1", "value1", NA_character_)
)
drv <- duckdb()
con <- dbConnect(drv)
dbWriteTable(con, "my_df", my_df)
# Should be 2 for var1 = 1 and 1 for var1 = 2
tbl(con, "my_df") |>
summarise(
test = n_distinct(
var2
),
.by = "var1"
)
Result :
tbl(con, "my_df") |>
+ summarise(
+ test = n_distinct(
+ var2
+ ),
+ .by = "var1"
+ )
# Source: SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
var1 test
<dbl> <dbl>
1 2 2
2 1 2
Handle NAs
No change if we try to handles NAs
# Try to handles NAs
tbl(con, "my_df") |>
summarise(
test = n_distinct(
var2[!is.na(var2)]
),
.by = "var1"
)
Results :
tbl(con, "my_df") |>
+ summarise(
+ test = n_distinct(
+ var2[!is.na(var2)]
+ ),
+ .by = "var1"
+ )
# Source: SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
var1 test
<dbl> <dbl>
1 2 2
2 1 2
# Raw SQL Without row() : correct result, even without NA handling
tbl(con, "my_df") |>
summarise(
test = sql(
"COUNT(DISTINCT var2)"
),
.by = "var1"
)
Result :
tbl(con, "my_df") |>
+ summarise(
+ test = sql(
+ "COUNT(DISTINCT var2)"
+ ),
+ .by = "var1"
+ )
# Source: SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
var1 test
<dbl> <dbl>
1 2 1
2 1 2
That's it. I don't know how to have the best of both world, where structs let us handle more complex cases but creates these weird situations with NAs that count as 1
Hi @rplsmn,
before reading your reprexes more carefully, a couple of questions:
- Does the behavior differ from the "dplyr" version of
n_distinct()? If yes, can you provide a simple reprex? (My goal was create a function with identical semantics) n_distinct()does have the argument:na.rmwhich defaults toFALSE. Is the behavior you are seekingna.rm = TRUE?
Only na.rm = FALSE is currently implemented for the duckdb backend, and I guess the proper way to solve this is to also implement the na.rm = TRUE separately.
Hello @lschneiderbauer and thank you for answering me.
-
The behavior doesn't differ from dplyr n_distinct, but that is also what changed with the new implementation. Before the new implementation, n_distinct in databases was implicitly doing
"na.rm = TRUE". I understand how the new version is better for consistency and predictable behavior. -
I am indeed seeking
"na.rm = TRUE", if only so that I can avoid rewriting most of the work I have that relies on the previous implementation of n_distinct (COUNT(DISTINCT ...). I am not skilled enough in SQL to propose an implementation, as I have tried an failed repeatedly before opening this issue :(
Cheers
I think the easiest way to implement this is to make use of FILTER to remove NULL valued rows. However, the docs say it does not work in a window context (yet?) and so that would probably only work within summarize() while it would not work within mutate().
IIRC that was also the reason I did choose to not implement it at all.
Another option would be to fall back to the "old" version altogether when na.rm = TRUE, this would then work only with one column, as it did before.
@lschneiderbauer: Would you like to implement the proposed solution -- fall back to the single column version with na.rm = TRUE ?
@rplsmn Would you be so kind and check/test if #216 contains a version of n_distinct() to your satisfaction?
Hello,
Sorry for the delay in answering, I was travelling back from posit conf ! I looked at #216 and it works, that would be awesome thanks
This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.