funique() returns two 0
I am currently struggling with a problem that could be related to https://github.com/SebKrantz/collapse/issues/452
Zeros appear twice:
library(collapse)
# all columns are double
df <- OpenML::getOMLDataSet(data.id = 45106L)$data
unique(df$town) # 1 0
funique(df$town) # 1 0 0
levels(factor(df$town)) # [1] "0" "1"
levels(collapse::qF(df$town)) #[1] "0" "0" "1"
The same happens in group by operations on town, e.g., comparing base::rowsum() and fmean(). The latter gets two categories "0".
The workaround by @NicChr of adding 0 to a double vector also works in this case. But I wonder how robust this could be?
I'm not sure how "robust" it is but I did get the idea from this thread which suggests adding 0 to eliminate negative zeros: https://stackoverflow.com/questions/13767744/detecting-and-adjusting-for-negative-zero
It seems to work in the examples I showed and 0 happens to be a handy identity element of the reals so the output should remain the same for most objects that base::round() accepts.
One long-term solution could be to export these functions in the collapse package:
#' @export
round <- function(x, digits = 0, ...){
base::round(x, digits, ...) + 0
}
#' @export
trunc <- function(x, ...){
base::trunc(x, ...) + 0
}
The solution to add 0 seems to work for C based languages, at least if they stick to ISO/IEC 60559
Thanks. I think though this is something that needs to be solved in R. In C adding 0 to every numeric number or checking for 0 would have noticeable performance implications. So the performance friendly way of handling this is to call funique() on your large vector, then add 0 in R, and call funique() again. Or do it all in one go with unique().
I added a branch "zero_dups" with this feature implemented in C remotes::install_github("SebKrantz/collapse", ref = "zero_dups"). Seems to work. You can help me benchmark it. I think it's like a 10% slower execution. Not sure that's worth it. Note that it only works for funique() and group() with default settings. funique(x, sort = TRUE) is implemented in Rcpp (using the sugar sort_unique algorithm, which suffers from the same issue, and which means I'd have to rewrite that as well).
Wow!
In my R code, I am doing:
if (is.double(x))
x <- x + 0
xu <- collapse::funique(x)
[...]
M <- collapse::fmean(..., g = x)
S <- collapse::fsd(..., g = x)
[...]
Not sure if you want to slow down the speed by 10% just because of this...
I added a branch "zero_dups" with this feature implemented in C
remotes::install_github("SebKrantz/collapse", ref = "zero_dups"). Seems to work. You can help me benchmark it. I think it's like a 10% slower execution. Not sure that's worth it. Note that it only works forfunique()andgroup()with default settings.funique(x, sort = TRUE)is implemented in Rcpp (using the sugarsort_uniquealgorithm, which suffers from the same issue, and which means I'd have to rewrite that as well).
I'm also seeing about a 10% reduction in speed.
I also tried the ternary operator e.g. tpv.d = px[i] != 0.0 ? px[i] : 0.0 and saw even more of a drop-off in speed (~20-30%) for vectors with lots of zeros.
In terms of an R solution, you can alternatively use anyv() to check for zeros after running funique()
funique2 <- function(x, ...){
out <- funique(x, ...)
if (is.double(out) && anyv(out, 0)){
unique(out)
} else {
out
}
}
library(collapse)
#> collapse 2.0.17, see ?`collapse-package` or ?`collapse-documentation`
#>
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#>
#> D
library(bench)
x <- round(rnorm(10^7)) # Many zeros
y <- rnorm(10^7, mean = 100 ) # No zeros
z <- floor(rnorm(10^7, mean = 10^4, sd = 10^4)) # Some zeros
mark(funique(x), funique2(x), check = F) # Many zeros, not many unique values
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 funique(x) 55ms 55.1ms 18.1 38.2MB 22.7
#> 2 funique2(x) 54.4ms 54.7ms 17.7 38.1MB 14.1
mark(funique(y), funique2(y), check = F) # No zeros, many unique values
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 funique(y) 308ms 308ms 3.24 38.1MB 3.24
#> 2 funique2(y) 321ms 321ms 3.11 38.1MB 3.11
mark(funique(z), funique2(z), check = F) # Some zeros, medium amount of unique values
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 funique(z) 179ms 183ms 5.47 38.7MB 2.73
#> 2 funique2(z) 170ms 177ms 5.64 40.5MB 2.82
Created on 2024-10-27 with reprex v2.1.1
Perhaps as an addition (for other functions like qF() etc.), you can add a zero by reference out %+=% 0, which is much more efficient than out <- out + 0. Lets keep this issue open, still need to check if there is a faster way to change the sign bit for zeros.
Actually I just tested, using an integer 0 is pretty fast! @NicChr can you confirm that adding an integer 0 (0 instead of 0.0) gives much less performance cost?
Actually I just tested, using an integer 0 is pretty fast! @NicChr can you confirm that adding an integer 0 (0 instead of 0.0) gives much less performance cost?
Assuming you mean replacing '+ 0.0' with '+ 0' in the 'kit_dup.c' file..
I looked at master, zero_dups and a 3rd branch zero_dups2 which replaced '+ 0.0' with '+ 0' and compared median benchmark times after installing each branch using remotes::install_local() for a comparison with compiler optimisations.
The data and code I used
library(collapse)
library(bench)
set.seed(42)
x <- round(rnorm(10^7)) # Many zeros
y <- rnorm(10^7, mean = 100 ) # No zeros
z <- floor(rnorm(10^7, mean = 10^4, sd = 10^4)) # Some zeros
mark(funique(x), iterations = 15, memory = FALSE)
mark(funique(y), iterations = 15, memory = FALSE)
mark(funique(z), iterations = 15, memory = FALSE)
The results
x -- master - 22.9ms / zero_dups - 25.4 ms / zero_dups2 - 25.5 ms y -- master - 287 ms / zero_dups - 299 ms / zero_dups2 - 295 ms z -- master - 145 ms / zero_dups - 157 ms / zero_dups2 - 153 ms
There's a slight performance cost on my side whether or not you add a double or int 0.
Ok, thanks! I will also check a bit further still. But I think if it is of this order - 2.8% performance decline for the integer case with no zeros - the most common case - we can go for it.