collapse icon indicating copy to clipboard operation
collapse copied to clipboard

funique() returns two 0

Open mayer79 opened this issue 1 year ago • 11 comments

I am currently struggling with a problem that could be related to https://github.com/SebKrantz/collapse/issues/452

Zeros appear twice:

library(collapse)

# all columns are double
df <- OpenML::getOMLDataSet(data.id = 45106L)$data

unique(df$town)  #  1 0
funique(df$town) #  1 0 0

levels(factor(df$town))           #  [1] "0" "1"
levels(collapse::qF(df$town))  #[1] "0" "0" "1"

The same happens in group by operations on town, e.g., comparing base::rowsum() and fmean(). The latter gets two categories "0".

mayer79 avatar Oct 26 '24 11:10 mayer79

The workaround by @NicChr of adding 0 to a double vector also works in this case. But I wonder how robust this could be?

mayer79 avatar Oct 26 '24 12:10 mayer79

I'm not sure how "robust" it is but I did get the idea from this thread which suggests adding 0 to eliminate negative zeros: https://stackoverflow.com/questions/13767744/detecting-and-adjusting-for-negative-zero

It seems to work in the examples I showed and 0 happens to be a handy identity element of the reals so the output should remain the same for most objects that base::round() accepts.

One long-term solution could be to export these functions in the collapse package:

#' @export
round <- function(x, digits = 0, ...){
  base::round(x, digits, ...) + 0
}
#' @export
trunc <- function(x, ...){
  base::trunc(x, ...) + 0
}

NicChr avatar Oct 26 '24 14:10 NicChr

The solution to add 0 seems to work for C based languages, at least if they stick to ISO/IEC 60559

mayer79 avatar Oct 26 '24 15:10 mayer79

Thanks. I think though this is something that needs to be solved in R. In C adding 0 to every numeric number or checking for 0 would have noticeable performance implications. So the performance friendly way of handling this is to call funique() on your large vector, then add 0 in R, and call funique() again. Or do it all in one go with unique().

SebKrantz avatar Oct 26 '24 17:10 SebKrantz

I added a branch "zero_dups" with this feature implemented in C remotes::install_github("SebKrantz/collapse", ref = "zero_dups"). Seems to work. You can help me benchmark it. I think it's like a 10% slower execution. Not sure that's worth it. Note that it only works for funique() and group() with default settings. funique(x, sort = TRUE) is implemented in Rcpp (using the sugar sort_unique algorithm, which suffers from the same issue, and which means I'd have to rewrite that as well).

SebKrantz avatar Oct 26 '24 17:10 SebKrantz

Wow!

In my R code, I am doing:

if (is.double(x)) 
  x <- x + 0

xu <- collapse::funique(x)
[...]
M <- collapse::fmean(..., g = x)
S <- collapse::fsd(..., g = x)
[...]

Not sure if you want to slow down the speed by 10% just because of this...

mayer79 avatar Oct 26 '24 18:10 mayer79

I added a branch "zero_dups" with this feature implemented in C remotes::install_github("SebKrantz/collapse", ref = "zero_dups"). Seems to work. You can help me benchmark it. I think it's like a 10% slower execution. Not sure that's worth it. Note that it only works for funique() and group() with default settings. funique(x, sort = TRUE) is implemented in Rcpp (using the sugar sort_unique algorithm, which suffers from the same issue, and which means I'd have to rewrite that as well).

I'm also seeing about a 10% reduction in speed. I also tried the ternary operator e.g. tpv.d = px[i] != 0.0 ? px[i] : 0.0 and saw even more of a drop-off in speed (~20-30%) for vectors with lots of zeros.

In terms of an R solution, you can alternatively use anyv() to check for zeros after running funique()

funique2 <- function(x, ...){
  out <- funique(x, ...)
  if (is.double(out) && anyv(out, 0)){
    unique(out)
  } else {
   out 
  }
}

library(collapse)
#> collapse 2.0.17, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#> 
#>     D
library(bench)
x <- round(rnorm(10^7)) # Many zeros
y <- rnorm(10^7, mean = 100 ) # No zeros
z <- floor(rnorm(10^7, mean = 10^4, sd = 10^4)) # Some zeros


mark(funique(x), funique2(x), check = F) # Many zeros, not many unique values
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 funique(x)      55ms   55.1ms      18.1    38.2MB     22.7
#> 2 funique2(x)   54.4ms   54.7ms      17.7    38.1MB     14.1
mark(funique(y), funique2(y), check = F) # No zeros, many unique values
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 funique(y)     308ms    308ms      3.24    38.1MB     3.24
#> 2 funique2(y)    321ms    321ms      3.11    38.1MB     3.11
mark(funique(z), funique2(z), check = F) # Some zeros, medium amount of unique values
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 funique(z)     179ms    183ms      5.47    38.7MB     2.73
#> 2 funique2(z)    170ms    177ms      5.64    40.5MB     2.82

Created on 2024-10-27 with reprex v2.1.1

NicChr avatar Oct 27 '24 08:10 NicChr

Perhaps as an addition (for other functions like qF() etc.), you can add a zero by reference out %+=% 0, which is much more efficient than out <- out + 0. Lets keep this issue open, still need to check if there is a faster way to change the sign bit for zeros.

SebKrantz avatar Oct 28 '24 16:10 SebKrantz

Actually I just tested, using an integer 0 is pretty fast! @NicChr can you confirm that adding an integer 0 (0 instead of 0.0) gives much less performance cost?

SebKrantz avatar Oct 28 '24 16:10 SebKrantz

Actually I just tested, using an integer 0 is pretty fast! @NicChr can you confirm that adding an integer 0 (0 instead of 0.0) gives much less performance cost?

Assuming you mean replacing '+ 0.0' with '+ 0' in the 'kit_dup.c' file..

I looked at master, zero_dups and a 3rd branch zero_dups2 which replaced '+ 0.0' with '+ 0' and compared median benchmark times after installing each branch using remotes::install_local() for a comparison with compiler optimisations.

The data and code I used

library(collapse)
library(bench)

set.seed(42)
x <- round(rnorm(10^7)) # Many zeros
y <- rnorm(10^7, mean = 100 ) # No zeros
z <- floor(rnorm(10^7, mean = 10^4, sd = 10^4)) # Some zeros

mark(funique(x), iterations = 15, memory = FALSE)
mark(funique(y), iterations = 15, memory = FALSE)
mark(funique(z), iterations = 15, memory = FALSE)

The results

x -- master - 22.9ms / zero_dups - 25.4 ms / zero_dups2 - 25.5 ms y -- master - 287 ms / zero_dups - 299 ms / zero_dups2 - 295 ms z -- master - 145 ms / zero_dups - 157 ms / zero_dups2 - 153 ms

There's a slight performance cost on my side whether or not you add a double or int 0.

NicChr avatar Oct 29 '24 09:10 NicChr

Ok, thanks! I will also check a bit further still. But I think if it is of this order - 2.8% performance decline for the integer case with no zeros - the most common case - we can go for it.

SebKrantz avatar Oct 29 '24 11:10 SebKrantz