duckdb-r icon indicating copy to clipboard operation
duckdb-r copied to clipboard

Duckdb-R n_distinct translation doesn't allow management of NA values

Open rplsmn opened this issue 1 year ago • 5 comments

Hello,

Duckdb-r has a dbplyr-like translation for the n_distinct() function that leverages the row() function in duckdb, opening-up more flexibilty. See here for the discussion : https://github.com/duckdb/duckdb-r/issues/110

And here for the function : R/backend-dbplyr__duckdb_connection.R

But this has consequences related to NA/NULL values in the table that imo make it less desirable than the previous translation.

Reprex below :

Try to count the number of distinct values in var2 for each var1 id It should be 2 for var1 = 1 and 1 for var1 = 2

library(dplyr)
library(dbplyr)
library(duckdb)
library(DBI)

my_df <- 
  data.frame(
    var1 = c(1, 1, 1, 2, 2),
    var2 = c("value1", "value2", "value1", "value1", NA_character_)
  )

drv <- duckdb()
con <- dbConnect(drv)

dbWriteTable(con, "my_df", my_df)

# Should be 2 for var1 = 1 and 1 for var1 = 2

tbl(con, "my_df") |> 
  summarise(
    test = n_distinct(
      var2
    ),
    .by = "var1"
  )

Result :

tbl(con, "my_df") |> 
+   summarise(
+     test = n_distinct(
+       var2
+     ),
+     .by = "var1"
+   )
# Source:   SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
   var1  test
  <dbl> <dbl>
1     2     2
2     1     2

Handle NAs

No change if we try to handles NAs

# Try to handles NAs

tbl(con, "my_df") |> 
  summarise(
    test = n_distinct(
      var2[!is.na(var2)]
    ),
    .by = "var1"
  )

Results :

tbl(con, "my_df") |> 
+   summarise(
+     test = n_distinct(
+       var2[!is.na(var2)]
+     ),
+     .by = "var1"
+   )
# Source:   SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
   var1  test
  <dbl> <dbl>
1     2     2
2     1     2

# Raw SQL Without row() : correct result, even without NA handling

tbl(con, "my_df") |> 
  summarise(
    test = sql(
      "COUNT(DISTINCT var2)"
    ),
    .by = "var1"
  ) 

Result :

tbl(con, "my_df") |> 
+   summarise(
+     test = sql(
+       "COUNT(DISTINCT var2)"
+     ),
+     .by = "var1"
+   ) 
# Source:   SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
   var1  test
  <dbl> <dbl>
1     2     1
2     1     2

That's it. I don't know how to have the best of both world, where structs let us handle more complex cases but creates these weird situations with NAs that count as 1

rplsmn avatar Jul 26 '24 12:07 rplsmn

Hi @rplsmn,

before reading your reprexes more carefully, a couple of questions:

  • Does the behavior differ from the "dplyr" version of n_distinct() ? If yes, can you provide a simple reprex? (My goal was create a function with identical semantics)
  • n_distinct() does have the argument: na.rm which defaults to FALSE. Is the behavior you are seeking na.rm = TRUE ?

Only na.rm = FALSE is currently implemented for the duckdb backend, and I guess the proper way to solve this is to also implement the na.rm = TRUE separately.

lschneiderbauer avatar Aug 12 '24 06:08 lschneiderbauer

Hello @lschneiderbauer and thank you for answering me.

  1. The behavior doesn't differ from dplyr n_distinct, but that is also what changed with the new implementation. Before the new implementation, n_distinct in databases was implicitly doing "na.rm = TRUE". I understand how the new version is better for consistency and predictable behavior.

  2. I am indeed seeking "na.rm = TRUE", if only so that I can avoid rewriting most of the work I have that relies on the previous implementation of n_distinct (COUNT(DISTINCT ...). I am not skilled enough in SQL to propose an implementation, as I have tried an failed repeatedly before opening this issue :(

Cheers

rplsmn avatar Aug 13 '24 02:08 rplsmn

I think the easiest way to implement this is to make use of FILTER to remove NULL valued rows. However, the docs say it does not work in a window context (yet?) and so that would probably only work within summarize() while it would not work within mutate(). IIRC that was also the reason I did choose to not implement it at all.

Another option would be to fall back to the "old" version altogether when na.rm = TRUE, this would then work only with one column, as it did before.

lschneiderbauer avatar Aug 13 '24 04:08 lschneiderbauer

@lschneiderbauer: Would you like to implement the proposed solution -- fall back to the single column version with na.rm = TRUE ?

krlmlr avatar Aug 17 '24 08:08 krlmlr

@rplsmn Would you be so kind and check/test if #216 contains a version of n_distinct() to your satisfaction?

lschneiderbauer avatar Aug 17 '24 18:08 lschneiderbauer

Hello,

Sorry for the delay in answering, I was travelling back from posit conf ! I looked at #216 and it works, that would be awesome thanks

rplsmn avatar Aug 20 '24 08:08 rplsmn

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

github-actions[bot] avatar Sep 15 '25 02:09 github-actions[bot]