effectsize icon indicating copy to clipboard operation
effectsize copied to clipboard

Integer overflow issue using non-parametric cles

Open andresfpatinomd opened this issue 3 years ago • 1 comments

Hello! Thank you for all the work you have done with this package. It is terrific! I have a question. I am assessing common language effect size with a sample size of around 137.000 observations (n1 = 28000, n2=108000). When I run the cles() function with the argument parametric = FALSE, all my p_superiority values are NA and I receive the following warning message: In n1 * n2 : NAs produced by integer overflow. I read about this issue and found something interesting:

  • Integers are limited to +/-2*10^9 (https://stackoverflow.com/questions/8804779/what-is-integer-overflow-in-r-and-how-can-it-happen). You can see it typing .Machine$integer.max in the console.
  • Regarding the body structure of cles() and p_superiority() with parametric = FALSE. I understand it uses the rank_biserial function, which has in its body an n1*n2 term.
  • Inside the rank_biserial() function, the n1 and n2 were created using the length() function, thus they are integers.
  • For a relatively big sample size of unequal proportions such as mine (n1 = 28000, n2=108000) n1n2 = 3024000000 which is higher than .Machine$integer.max (2147483647). Thus when n1 and n2 are integers, the code line n1n2 lead to an integer overflow and subsequent NAs in rank_biserial and non-parametric CLES.
  • I thought it could be solved if the lines where n1 and n2 were created as n1<-as.numeric(length(x)) and n1<-as.numeric(length(y)) in the rank_biserial function. I tried to fix it in my local environment it using the edit() or fix() functions however when i run the edited function I have the error: Error in .get_data_2_samples(x, y, data, verbose, ...) :could not find function ".get_data_2_samples". So I am not totally sure if I am right.

Here is some come which reproduces the error

library(easystats)

z0<-data.frame( x =abs(rnorm(137000, mean =c(10,12), sd =c(2,6))),
 y=factor(rep(letters[1:2], length.out = 137000)))

cles(x = "x", y = "y", data =z0, parametric = F)
#or
rank_biserial(x = "x", y = "y", data =z0)

What do you think about it?

andresfpatinomd avatar Sep 10 '22 19:09 andresfpatinomd

I can confirm that changing those ns to as.double(length(x)) makes the code run without errors or warnings.

However, I am not really a comp-sci guy, so I'm not sure what the risks are in such a case. Will float point jeopardize the integrity of the calculations when Ns are small? @easystats/maintainers

mattansb avatar Sep 12 '22 18:09 mattansb

Fixed

library(effectsize)


n <- floor(sqrt(.Machine$integer.max) + 1)
x <- rnorm(n)
y <- rnorm(n) + 0.2

rank_biserial(x, y)
#> r (rank biserial) |         95% CI
#> ----------------------------------
#> -0.11             | [-0.12, -0.10]

Created on 2022-11-08 with reprex v2.0.2

mattansb avatar Nov 08 '22 14:11 mattansb