matrixStats icon indicating copy to clipboard operation
matrixStats copied to clipboard

WISH: matrixStats for vectors

Open matthieugomez opened this issue 10 years ago • 4 comments
trafficstars

It seems like some functions from matrixStats are faster than base R functions on simple vectors, which is impressive.

Since I primarily work with vectors (data.frames), I'd be interested in knowing which functions are faster than base R even with simple vectors. Could you add such a list in the Readme? Benchmarks only cover matrix cases.

From what I understand, the optimization for vectors comes from (i) integer method (ii) handling na.rm=TRUE in a wayy that avoids to create a new vector. Are there other reasons I'm missing?

matthieugomez avatar Jun 02 '15 19:06 matthieugomez

To complete my previous post, R misses a set of functions that allow to compute efficiently

  • moments mean, sd, var, cov, skewness, kurtosis
  • quantile type 3: min, max and any quantile.

All with a potential na.rm, row, and weight option.

It seems like this package almost has this set of functions (instead of the row option, which apparently is forthcoming, I use use weight = weight * boolean vector).

It'd be super nice to export this set of functions with a consistent naming and syntax (for now, some functions require to use the Col with dim = c(length, 1)), so that more people use them.

matthieugomez avatar Jun 02 '15 21:06 matthieugomez

Hi and thanks for the positive feedback.

First, I guess you've already found the benchmark reports on the wiki. There you'll also see how you can run them yourself;

path <- system.file("benchmarking", package="matrixStats")
R.rsp::rfile("index.md.rsp", path=path)

This will actually benchmark a few more functions, of which some are "hidden" vector functions that we've been toying around with and that may or may not end up in the stable/public API, which is why I don't post them online at this point.

So, yes, I/we are looking into adding some optimized functions for calculations on vectors. But, for various reasons, I decided to hold that back for a while until we've got the matrix-based calculations set.

I'd say the main reasons why matrixStats is fast, is:

  • Optimized calculations for integers and doubles, when possible.
  • Avoiding copying of objects, e.g. coercion from integer to double, dropping NAs etc.
  • Accessing data in a way that increases the chances for cache-hits (at least we keep that in mind when implementing the code)
  • Different accessing of elements and calculations depending on whether it is row or column-based operations (this avoids copying, minimizes memory usage, and sometimes increase cache hits).
  • (maybe something else I forgot)

What will be added to this is that we're adding support for specifying on what subset of elements calculations should be done. This is done by a talented student with support from the Google Summer of Code 2015 framework. Part of this project is to also add support for implicit parallel (multi-core) computations. These efforts will bring some changes (and cleanup/restructuring) to the internal native API.

So, when this is in place and have stabilized, yes, I can certainly see that it would make sense to open up for vector functions at the R API that ties into these low-level functions. As you say, in the end of the day, this is almost already there since one can turn the vector into a one-column matrix already today (and yes, I've also toyed around with letting the user specify the dimension as a separate optional argument avoid having to create a new matrix, cf. base::.colSums(X, m, n, na.rm = FALSE) - note the dot).

Also, before I forget, the functions in core R that are implemented in native code are quite optimized (e.g. sum, mean, var, cov, sd, ...), so there I don't think matrixStats will contribute much speed improvements (expect when it comes to subsetted calculations and possibly low-level multi-core processing which is not done by R). But, still, yes, there are certainly room for some improvements.

Hope this clarifies our near-future plans

HenrikBengtsson avatar Jun 03 '15 02:06 HenrikBengtsson

Thanks for the answer, it's very interesting. About your last paragraph. Even with raw vectors and without weight/selection, it seems like colRanges is faster than range for integers or doubles.

N <- 1e7
v1 =  sample(1e6, N, TRUE)
v2 =  runif(N ,max=100)

library(microbenchmark)
f <- function(x){colRanges(x, dim = c(length(x), 1))}
microbenchmark(f(v1), range(v1))
#      expr      min       lq     mean   median       uq      max neval
#     f(v1) 14.37535 14.72559 15.00326 14.95468 15.24036 16.02438   100
# range(v1) 45.49626 46.21895 50.83107 47.05837 48.29825 99.95897   100
microbenchmark(f(v2), range(v2))
#      expr      min       lq     mean   median       uq       max neval
#     f(v2) 15.71519 16.46514 16.72476 16.74529 17.00566  17.92737   100
# range(v2) 52.03146 53.96949 62.75381 54.92277 68.38117 192.66864   100

matthieugomez avatar Jun 03 '15 15:06 matthieugomez

That's interesting, it turns out that base::range() ends up calling min() and max() separately, cf.

> range.default
function (..., na.rm = FALSE, finite = FALSE)
{
    x <- c(..., recursive = TRUE)
    if (is.numeric(x)) {
        if (finite)
            x <- x[is.finite(x)]
        else if (na.rm)
            x <- x[!is.na(x)]
        return(c(min(x), max(x)))
    }
    c(min(x, na.rm = na.rm), max(x, na.rm = na.rm))
}

which means that it scans through the data twice. Anecdotally, colMins() and colMaxs() used to rely on colRanges(), resulting in those two scanning through data twice although only needed to do so once; now they no longer do that.

So, with base::min() we get that:

library(matrixStats)
library(microbenchmark)
N <- 1e7
v1 <-  sample(1e6, N, TRUE)
v2 <-  runif(N, max=100)
> f <- function(x){ colMins(x, dim = c(length(x), 1)) }
> microbenchmark(f(v1), min(v1))
Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval cld
   f(v1) 14.29829 15.04548 15.71372 15.52648 16.07003 22.30068   100   b
 min(v1) 13.49451 13.72471 14.29720 14.17683 14.61010 17.72860   100  a
> microbenchmark(f(v2), min(v2))
Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval cld
   f(v2) 32.47035 32.82162 34.59162 33.76745 35.06473 50.29672   100   b
 min(v2) 26.28532 27.45038 29.21760 28.31844 29.33934 41.77619   100  a

That's what I mean by lots of the functions in core R are quite optimized, but as you've spotted that are cases that are not (unintentionally or intentionally as here; allowing range() to dispatch).

BTW, another defense for of base::range() it supports S3 dispatching, which adds some overhead, which is relatively large for short vectors. A recent design decision of matrixStats is to leave dispatching to users, i.e. to not do dispatching allowing for minimal overhead. There are still some tweaks we can do to scrape off a bit more of the calling overhead, which somewhat explains why colMins() is teeny bit slower than min() above. However, I'll leave that to the future after the bigger speed improvements have been made.

HenrikBengtsson avatar Jun 03 '15 18:06 HenrikBengtsson