duckdb-r Duckdb-R n_distinct translation doesn't allow management of NA values

Duckdb-R n_distinct translation doesn't allow management of NA values

Open rplsmn opened this issue 6 months ago • 5 comments

Hello,

Duckdb-r has a dbplyr-like translation for the n_distinct() function that leverages the row() function in duckdb, opening-up more flexibilty. See here for the discussion : https://github.com/duckdb/duckdb-r/issues/110

And here for the function : R/backend-dbplyr__duckdb_connection.R

But this has consequences related to NA/NULL values in the table that imo make it less desirable than the previous translation.

Reprex below :

Try to count the number of distinct values in var2 for each var1 id It should be 2 for var1 = 1 and 1 for var1 = 2

library(dplyr)
library(dbplyr)
library(duckdb)
library(DBI)

my_df <- 
  data.frame(
    var1 = c(1, 1, 1, 2, 2),
    var2 = c("value1", "value2", "value1", "value1", NA_character_)
  )

drv <- duckdb()
con <- dbConnect(drv)

dbWriteTable(con, "my_df", my_df)

# Should be 2 for var1 = 1 and 1 for var1 = 2

tbl(con, "my_df") |> 
  summarise(
    test = n_distinct(
      var2
    ),
    .by = "var1"
  )

Result :

tbl(con, "my_df") |> 
+   summarise(
+     test = n_distinct(
+       var2
+     ),
+     .by = "var1"
+   )
# Source:   SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
   var1  test
  <dbl> <dbl>
1     2     2
2     1     2

Handle NAs

No change if we try to handles NAs

# Try to handles NAs

tbl(con, "my_df") |> 
  summarise(
    test = n_distinct(
      var2[!is.na(var2)]
    ),
    .by = "var1"
  )

Results :

tbl(con, "my_df") |> 
+   summarise(
+     test = n_distinct(
+       var2[!is.na(var2)]
+     ),
+     .by = "var1"
+   )
# Source:   SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
   var1  test
  <dbl> <dbl>
1     2     2
2     1     2

# Raw SQL Without row() : correct result, even without NA handling

tbl(con, "my_df") |> 
  summarise(
    test = sql(
      "COUNT(DISTINCT var2)"
    ),
    .by = "var1"
  )

Result :

tbl(con, "my_df") |> 
+   summarise(
+     test = sql(
+       "COUNT(DISTINCT var2)"
+     ),
+     .by = "var1"
+   ) 
# Source:   SQL [2 x 2]
# Database: DuckDB v1.0.0 [unknown@Linux 5.15.0-116-generic:R 4.2.2/:memory:]
   var1  test
  <dbl> <dbl>
1     2     1
2     1     2

That's it. I don't know how to have the best of both world, where structs let us handle more complex cases but creates these weird situations with NAs that count as 1

Jul 26 '24 12:07 rplsmn

duckdb-r duckdb-r copied to clipboard

Duckdb-R n_distinct translation doesn't allow management of NA values

Handle NAs

duckdb-r
duckdb-r copied to clipboard