effectsize
effectsize copied to clipboard
Integer overflow issue using non-parametric cles
Hello! Thank you for all the work you have done with this package. It is terrific! I have a question. I am assessing common language effect size with a sample size of around 137.000 observations (n1 = 28000, n2=108000). When I run the cles() function with the argument parametric = FALSE, all my p_superiority values are NA and I receive the following warning message: In n1 * n2 : NAs produced by integer overflow. I read about this issue and found something interesting:
- Integers are limited to +/-2*10^9 (https://stackoverflow.com/questions/8804779/what-is-integer-overflow-in-r-and-how-can-it-happen). You can see it typing .Machine$integer.max in the console.
- Regarding the body structure of cles() and p_superiority() with parametric = FALSE. I understand it uses the rank_biserial function, which has in its body an n1*n2 term.
- Inside the rank_biserial() function, the n1 and n2 were created using the length() function, thus they are integers.
- For a relatively big sample size of unequal proportions such as mine (n1 = 28000, n2=108000) n1n2 = 3024000000 which is higher than .Machine$integer.max (2147483647). Thus when n1 and n2 are integers, the code line n1n2 lead to an integer overflow and subsequent NAs in rank_biserial and non-parametric CLES.
- I thought it could be solved if the lines where n1 and n2 were created as n1<-as.numeric(length(x)) and n1<-as.numeric(length(y)) in the rank_biserial function. I tried to fix it in my local environment it using the edit() or fix() functions however when i run the edited function I have the error: Error in .get_data_2_samples(x, y, data, verbose, ...) :could not find function ".get_data_2_samples". So I am not totally sure if I am right.
Here is some come which reproduces the error
library(easystats)
z0<-data.frame( x =abs(rnorm(137000, mean =c(10,12), sd =c(2,6))),
y=factor(rep(letters[1:2], length.out = 137000)))
cles(x = "x", y = "y", data =z0, parametric = F)
#or
rank_biserial(x = "x", y = "y", data =z0)
What do you think about it?
I can confirm that changing those ns to as.double(length(x)) makes the code run without errors or warnings.
However, I am not really a comp-sci guy, so I'm not sure what the risks are in such a case. Will float point jeopardize the integrity of the calculations when Ns are small? @easystats/maintainers
Fixed
library(effectsize)
n <- floor(sqrt(.Machine$integer.max) + 1)
x <- rnorm(n)
y <- rnorm(n) + 0.2
rank_biserial(x, y)
#> r (rank biserial) | 95% CI
#> ----------------------------------
#> -0.11 | [-0.12, -0.10]
Created on 2022-11-08 with reprex v2.0.2