matrixStats icon indicating copy to clipboard operation
matrixStats copied to clipboard

sum, colSums, cumsum values differ from base-functions

Open arrayn opened this issue 7 years ago • 5 comments

sum, colSums, cumsum: Base-functions values match with themselves. MatrixStats-functions values match with themselves. However, base and matrixStats functions yield different values from each other. Example code:

## sum, colSums, cumsum values differ from base-functions
##
## matrixStats_0.52.2
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## create a random real-matrix with one column
r2a = matrix(rnorm(10000, sd=0.01), ncol=1)
##
print('base functions match with themselves (zero):')
sum(r2a)               - colSums(r2a)
sum(r2a)               - tail(cumsum(r2a), 1)
##
print('matrixStats functions match with themselves (zero):')
matrixStats::sum2(r2a) - matrixStats::colSums2(r2a)
matrixStats::sum2(r2a) - tail(matrixStats::colCumsums(r2a), 1)
##
print('base and matrixStats functions DO NOT MATCH (NON-ZERO):')
sum(r2a)               - matrixStats::sum2(r2a)
sum(r2a)               - matrixStats::colSums2(r2a)
sum(r2a)               - tail(matrixStats::colCumsums(r2a), 1)

arrayn avatar Apr 26 '17 10:04 arrayn

Hi, and thanks for reaching out.

First, in case someone else wonders about the size of the diff:

> d <- sum(r2a) - matrixStats::sum2(r2a)
> d
[1] 1.554312e-15

This is down scraping at the level of:

> .Machine$double.eps
[1] 2.220446e-16

So, instead of being identical (= "MATCH"), they are equal to some near-machine limit):

> all.equal(sum(r2a), matrixStats::sum2(r2a))
[1] TRUE
> all.equal(sum(r2a), matrixStats::colSums2(r2a))
[1] TRUE
> all.equal(sum(r2a), as.vector(tail(matrixStats::colCumsums(r2a), 1)))
[1] TRUE

Some matrixStats functions pass identical() tests and all pass all.equal() tests, e.g. https://github.com/HenrikBengtsson/matrixStats/blob/0.52.2/tests/sum2.R#L45. You cannot rely on identical() (or identical to a floating-point zero) when comparing numeric (real) values in any type of analysis.

Hope this clarifies it.

Details

The differences observed between base::sum() and matrixStats::sum2() might be attributed to compiler optimization differences, e.g. the order in which you add up floating-point values matters. This cannot be controlled for. base::sum() is implemented as:

static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
{
    LDOUBLE s = 0.0;
    Rboolean updated = FALSE;

    for (R_xlen_t i = 0; i < n; i++) {
	if (!narm || !ISNAN(x[i])) {
	    if(!updated) updated = TRUE;
	    s += x[i];
	}
    }
[...]
}

and matrixStats::sum2() as (edited for clarity):

RETURN_TYPE METHOD_NAME_IDXS(ARGUMENTS_LIST) {
  X_C_TYPE value;
  R_xlen_t ii;
  LDOUBLE sum = 0;
[...]
  for (ii=0; ii < nidxs; ++ii) {
    value = x[ii];  # [ edited for clarity]
[...]
    if (!narm) {
      sum += (LDOUBLE)value;
[...]
    } else if (!ISNAN(value)) {
      sum += (LDOUBLE)value;
    }
[...]
  } /* for (ii ...) */

  return (double)sum;
}

PS. (Writing reply I realize that I might have missed a check for long double-to-double overflow protection when returning the results, but that independent of you concerns).

HenrikBengtsson avatar Apr 26 '17 15:04 HenrikBengtsson

Thanks for a thorough explanation. I noticed this phenomenon originally when I was prompted to explore why I got interesting results using base::max.col(m, ties.method='first') , which is essentially base::which.max() applied to each row.

Do you happen to know a way to treat 'near-machine-precision equal' as true-equal when using comparing-based functions (> , < , order , rank , max, which.max, ...)?

arrayn avatar Apr 26 '17 16:04 arrayn

This would maybe be more appropriate on issue called "near-equal values something", but I put this here for the sake of continuity:

I tried different ties.methods of functions which.max() max.col() order() rank() matrixStats::rowRanks() (code below) and all except one combination treat near-equal values so that one value is greater than the other.

Only max.col(r2a, ties.method='random') treats near-equal values as equal. Too bad that the randomness ruins the predictability / reproductibility of results - which is the main reason I am interested in this stuff in the first place.

I would very much like to see these kinds of functions having an option for treating near-equal values as equal. That would make for the most robustly predictable / reproducible results. Quoting a wise man from few comments ago: "You cannot rely on identical() (or identical to a floating-point zero) when comparing numeric (real) values in any type of analysis."

print("max.col(r2a, ties.method='random' treats near-equal values as equal")
print("all other tried ways do NOT treat near-equal values as equal")
print('r2a is a real-matrix with one row and two almost identical values')
r2a = rbind(c(0.1, 1.1 - 1.0))
r2a
print('r2a has very small difference (second is larger than first):')
diff(as.vector(r2a))
print('r2b has very small difference (second is smaller than first):')
r2b = rbind(rev(r2a))
diff(as.vector(r2b))
##
i0n        = 100L
i1whichmax = integer(i0n)
i1random   = integer(i0n)
i1first    = integer(i0n)
i1last     = integer(i0n)
for (i0i in seq_len(i0n)) {
  i1whichmax[i0i] = which.max(r2a)
  i1random[i0i]   = max.col(r2a, ties.method='random')
  i1first[i0i]    = max.col(r2a, ties.method='first')
  i1last[i0i]     = max.col(r2a, ties.method='last')
}
table(i1whichmax) ## all 2
table(i1random)   ## about 50/50 random 1s and 2s
table(i1first)    ## all 2
table(i1last)     ## all 2
##
print('order(r2a) and order(r2b) are always 1,2 and 2,1:')
order(r2a)
order(r2b)
print('rank(r2a) and rank(r2b) are always 1,2 and 2,1:')
rank(r2a)
rank(r2b)
print("... even with ties.method='random' :")
rank(r2a, ties.method='random')
rank(r2b, ties.method='random')
##
print('matrixStats::rowRanks(r2a) are all 1,2 regardless of ties.method:')
matrixStats::rowRanks(r2a, ties.method='max')
matrixStats::rowRanks(r2a, ties.method='average')
matrixStats::rowRanks(r2a, ties.method='min')
print('matrixStats::rowRanks(r2b) are all 2,1 regardless of ties.method:')
matrixStats::rowRanks(r2b, ties.method='max')
matrixStats::rowRanks(r2b, ties.method='average')
matrixStats::rowRanks(r2b, ties.method='min')

arrayn avatar Apr 27 '17 14:04 arrayn

Would you consider adding notes (or a PR adding notes) to the documentation that results obtained with matrixStats may be 'equal' but not 'identical' to base functions? I was looking for an "official" confirmation that this explained some numerical discrepancies I observed. My googling returned this issue but I couldn't find anything in the documentation (?colMean2, in my case).

PeteHaitch avatar Apr 18 '18 13:04 PeteHaitch

Yes, I'd be open to have this documented somewhere. A PR would be appreciated.

HenrikBengtsson avatar Apr 21 '18 23:04 HenrikBengtsson