matrixStats
matrixStats copied to clipboard
sum, colSums, cumsum values differ from base-functions
sum, colSums, cumsum: Base-functions values match with themselves. MatrixStats-functions values match with themselves. However, base and matrixStats functions yield different values from each other. Example code:
## sum, colSums, cumsum values differ from base-functions
##
## matrixStats_0.52.2
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## create a random real-matrix with one column
r2a = matrix(rnorm(10000, sd=0.01), ncol=1)
##
print('base functions match with themselves (zero):')
sum(r2a) - colSums(r2a)
sum(r2a) - tail(cumsum(r2a), 1)
##
print('matrixStats functions match with themselves (zero):')
matrixStats::sum2(r2a) - matrixStats::colSums2(r2a)
matrixStats::sum2(r2a) - tail(matrixStats::colCumsums(r2a), 1)
##
print('base and matrixStats functions DO NOT MATCH (NON-ZERO):')
sum(r2a) - matrixStats::sum2(r2a)
sum(r2a) - matrixStats::colSums2(r2a)
sum(r2a) - tail(matrixStats::colCumsums(r2a), 1)
Hi, and thanks for reaching out.
First, in case someone else wonders about the size of the diff:
> d <- sum(r2a) - matrixStats::sum2(r2a)
> d
[1] 1.554312e-15
This is down scraping at the level of:
> .Machine$double.eps
[1] 2.220446e-16
So, instead of being identical (= "MATCH"), they are equal to some near-machine limit):
> all.equal(sum(r2a), matrixStats::sum2(r2a))
[1] TRUE
> all.equal(sum(r2a), matrixStats::colSums2(r2a))
[1] TRUE
> all.equal(sum(r2a), as.vector(tail(matrixStats::colCumsums(r2a), 1)))
[1] TRUE
Some matrixStats functions pass identical()
tests and all pass all.equal()
tests, e.g. https://github.com/HenrikBengtsson/matrixStats/blob/0.52.2/tests/sum2.R#L45. You cannot rely on identical()
(or identical to a floating-point zero) when comparing numeric (real) values in any type of analysis.
Hope this clarifies it.
Details
The differences observed between base::sum()
and matrixStats::sum2()
might be attributed to compiler optimization differences, e.g. the order in which you add up floating-point values matters. This cannot be controlled for. base::sum()
is implemented as:
static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
{
LDOUBLE s = 0.0;
Rboolean updated = FALSE;
for (R_xlen_t i = 0; i < n; i++) {
if (!narm || !ISNAN(x[i])) {
if(!updated) updated = TRUE;
s += x[i];
}
}
[...]
}
and matrixStats::sum2()
as (edited for clarity):
RETURN_TYPE METHOD_NAME_IDXS(ARGUMENTS_LIST) {
X_C_TYPE value;
R_xlen_t ii;
LDOUBLE sum = 0;
[...]
for (ii=0; ii < nidxs; ++ii) {
value = x[ii]; # [ edited for clarity]
[...]
if (!narm) {
sum += (LDOUBLE)value;
[...]
} else if (!ISNAN(value)) {
sum += (LDOUBLE)value;
}
[...]
} /* for (ii ...) */
return (double)sum;
}
PS. (Writing reply I realize that I might have missed a check for long double-to-double overflow protection when returning the results, but that independent of you concerns).
Thanks for a thorough explanation. I noticed this phenomenon originally when I was prompted to explore why I got interesting results using base::max.col(m, ties.method='first')
, which is essentially base::which.max()
applied to each row.
Do you happen to know a way to treat 'near-machine-precision equal' as true-equal when using comparing-based functions (> , < , order , rank , max, which.max, ...)?
This would maybe be more appropriate on issue called "near-equal values something", but I put this here for the sake of continuity:
I tried different ties.methods of functions which.max() max.col() order() rank() matrixStats::rowRanks()
(code below) and all except one combination treat near-equal values so that one value is greater than the other.
Only max.col(r2a, ties.method='random')
treats near-equal values as equal. Too bad that the randomness ruins the predictability / reproductibility of results - which is the main reason I am interested in this stuff in the first place.
I would very much like to see these kinds of functions having an option for treating near-equal values as equal. That would make for the most robustly predictable / reproducible results. Quoting a wise man from few comments ago: "You cannot rely on identical() (or identical to a floating-point zero) when comparing numeric (real) values in any type of analysis."
print("max.col(r2a, ties.method='random' treats near-equal values as equal")
print("all other tried ways do NOT treat near-equal values as equal")
print('r2a is a real-matrix with one row and two almost identical values')
r2a = rbind(c(0.1, 1.1 - 1.0))
r2a
print('r2a has very small difference (second is larger than first):')
diff(as.vector(r2a))
print('r2b has very small difference (second is smaller than first):')
r2b = rbind(rev(r2a))
diff(as.vector(r2b))
##
i0n = 100L
i1whichmax = integer(i0n)
i1random = integer(i0n)
i1first = integer(i0n)
i1last = integer(i0n)
for (i0i in seq_len(i0n)) {
i1whichmax[i0i] = which.max(r2a)
i1random[i0i] = max.col(r2a, ties.method='random')
i1first[i0i] = max.col(r2a, ties.method='first')
i1last[i0i] = max.col(r2a, ties.method='last')
}
table(i1whichmax) ## all 2
table(i1random) ## about 50/50 random 1s and 2s
table(i1first) ## all 2
table(i1last) ## all 2
##
print('order(r2a) and order(r2b) are always 1,2 and 2,1:')
order(r2a)
order(r2b)
print('rank(r2a) and rank(r2b) are always 1,2 and 2,1:')
rank(r2a)
rank(r2b)
print("... even with ties.method='random' :")
rank(r2a, ties.method='random')
rank(r2b, ties.method='random')
##
print('matrixStats::rowRanks(r2a) are all 1,2 regardless of ties.method:')
matrixStats::rowRanks(r2a, ties.method='max')
matrixStats::rowRanks(r2a, ties.method='average')
matrixStats::rowRanks(r2a, ties.method='min')
print('matrixStats::rowRanks(r2b) are all 2,1 regardless of ties.method:')
matrixStats::rowRanks(r2b, ties.method='max')
matrixStats::rowRanks(r2b, ties.method='average')
matrixStats::rowRanks(r2b, ties.method='min')
Would you consider adding notes (or a PR adding notes) to the documentation that results obtained with matrixStats may be 'equal' but not 'identical' to base functions? I was looking for an "official" confirmation that this explained some numerical discrepancies I observed. My googling returned this issue but I couldn't find anything in the documentation (?colMean2
, in my case).
Yes, I'd be open to have this documented somewhere. A PR would be appreciated.