xts icon indicating copy to clipboard operation
xts copied to clipboard

na.fill long call times

Open ckatsulis opened this issue 6 years ago • 9 comments

na.fill() takes much longer than other methods to fill NA values. I expected near constant time to find and fill NA values.

Reproducible timings are below. I will send testData via email independently as it is too large for this forum.

testData[is.na(testData)] = 0
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max neval
testData[is.na(testData)] 26.52425 35.31587 37.19453 37.23217 38.80555 46.61541    10

> microbenchmark::microbenchmark(na.fill(testData,0),times = 1)
Unit: seconds
                 expr      min       lq     mean   median       uq      max neval
 na.fill(testData, 0) 150.9054 150.9054 150.9054 150.9054 150.9054 150.9054     1

Unit: seconds
                           expr      min       lq     mean   median       uq      max neval
 na.fill(coredata(testData), 0) 52.90975 52.90975 52.90975 52.90975 52.90975 52.90975     1

Session Info

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8         
 [8] LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] roll_1.0.7                 PerformanceAnalytics_1.5.2 xts_0.10-2.2               zoo_1.8-2                  TTR_0.23-3                 dygraphs_1.1.1.5          

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17         magrittr_1.5         xtable_1.8-2         lattice_0.20-35      R6_2.2.2             quadprog_1.5-5       xlsx_0.6.1           tools_3.5.1          grid_3.5.1           data.table_1.11.4    qmao_1.6.11         
[12] R.oo_1.22.0          htmltools_0.3.6      yaml_2.1.19          RcppParallel_4.4.0   digest_0.6.15        RJSONIO_1.3-0        shiny_1.0.5          rJava_0.9-10         later_0.7.3          microbenchmark_1.4-4 promises_1.0.1      
[23] bitops_1.0-6         htmlwidgets_0.9      R.utils_2.6.0        xlsxjars_0.6.1       RCurl_1.95-4.10      curl_3.2             mime_0.5             pander_0.6.2         compiler_3.5.1       R.methodsS3_1.7.1    jsonlite_1.5        
[34] httpuv_1.4.4.2      

ckatsulis avatar Jul 25 '18 16:07 ckatsulis

Note that there is no na.fill.xts() method, so zoo:::na.fill.zoo() is dispatched. We can probably add an xts method, and only dispatch to the zoo method if fill is not a scalar.

joshuaulrich avatar Jul 25 '18 18:07 joshuaulrich

That would be consistent, just curious why the zoo method is so much slower than simply specifying the condition in brackets.

On Wed, Jul 25, 2018 at 1:43 PM, Joshua Ulrich [email protected] wrote:

Note that there is no na.fill.xts() method, so zoo:::na.fill.zoo() is dispatched. We can probably add an xts method, and only dispatch to the zoo method if fill is not a scalar.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/joshuaulrich/xts/issues/259#issuecomment-407855781, or mute the thread https://github.com/notifications/unsubscribe-auth/AHKwqnDtyu2BaFf5gx1klzac3DjMVSF3ks5uKLxXgaJpZM4Vgcoi .

ckatsulis avatar Jul 25 '18 18:07 ckatsulis

@ckatsulis it looks like a lot of the time is spent in creating symbol names from the ... arguments passed to merge.xts(). This is related to #248.

joshuaulrich avatar Aug 11 '18 12:08 joshuaulrich

Hi @ckatsulis, the fix for #248 improves performance for this case fairly dramatically. Though it's still not nearly as fast as x[is.na(x)] <- 0. Is the new performance sufficient to close this issue?

x <- .xts(1:1e7, 1:1e7)
is.na(x) <- sample(1e7, 1e6) 
microbenchmark::microbenchmark(na.fill(x, 0), times = 1) 
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#  na.fill(x, 0) 1.944266 1.944266 1.944266 1.944266 1.944266 1.944266     1
y <- coredata(x)
microbenchmark::microbenchmark(na.fill(y, 0), times = 1) 
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#  na.fill(y, 0) 6.566733 6.566733 6.566733 6.566733 6.566733 6.566733     1
microbenchmark::microbenchmark(X[is.na(X)] <- 0, times = 1, setup = X <- x)
# Unit: milliseconds
#              expr      min       lq     mean   median       uq      max neval
#  X[is.na(X)] <- 0 176.9659 176.9659 176.9659 176.9659 176.9659 176.9659     1

joshuaulrich avatar Nov 03 '18 11:11 joshuaulrich

Performance using Rcpp ifelse()

library(xts)
Rcpp::cppFunction("
NumericMatrix na_fill(const NumericMatrix x, const double fill) {
  NumericMatrix r = clone(x);
  for (int j=0; j < r.ncol(); j++) {
    r(_,j) = ifelse( is_na(r(_,j)), fill, r(_,j) );
  }
  return r;
}")
f1 = function(x, fill) {
  x[is.na(x)] <- fill
  x
}
x <- .xts(1:1e7, 1:1e7)
is.na(x) <- sample(1e7, 1e6) 
microbenchmark::microbenchmark(na.fill(x, 0), 
                               na_fill(x, 0), 
                               f1(x, 0),
                               times=100)
Unit: milliseconds
          expr        min         lq       mean     median        uq       max neval
 na.fill(x, 0) 1082.35488 1146.39916 1178.56886 1175.79914 1203.5146 1309.6379   100
 na_fill(x, 0)   71.08533   76.49017   97.88448   89.81974  110.8743  166.9555   100
      f1(x, 0)   69.06708   78.01604   98.97025   93.17469  105.8702  177.6743   100

harvey131 avatar Dec 12 '18 17:12 harvey131

Two orders of magnitude better is more than enough for me. Is na_fill exported in the new xts?

On Wed, Dec 12, 2018 at 11:29 AM harvey131 [email protected] wrote:

Performance using Rcpp ifelse()

library(xts) Rcpp::cppFunction(" NumericMatrix na_fill(const NumericMatrix x, const double fill) { NumericMatrix r = clone(x); for (int j=0; j < r.ncol(); j++) { r(,j) = ifelse( is_na(r(,j)), fill, r(_,j) ); } return r; }") f1 = function(x, fill) { x[is.na(x)] <- fill x } x <- .xts(1:1e7, 1:1e7)is.na(x) <- sample(1e7, 1e6) microbenchmark::microbenchmark(na.fill(x, 0), na_fill(x, 0), f1(x, 0), times=100) Unit: milliseconds expr min lq mean median uq max neval na.fill(x, 0) 1082.35488 1146.39916 1178.56886 1175.79914 1203.5146 1309.6379 100 na_fill(x, 0) 71.08533 76.49017 97.88448 89.81974 110.8743 166.9555 100 f1(x, 0) 69.06708 78.01604 98.97025 93.17469 105.8702 177.6743 100

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joshuaulrich/xts/issues/259#issuecomment-446673018, or mute the thread https://github.com/notifications/unsubscribe-auth/AHKwqp_LxDNh7N6dHScaIaDp8N7vtM8-ks5u4T0MgaJpZM4Vgcoi .

ckatsulis avatar Dec 12 '18 19:12 ckatsulis

@ckatsulis the code @harvey131 posted isn't in xts, let alone exported. I'm not likely to add a dependency on compiled code unless there's significant performance improvement over a native R solution. In this case, the Rcpp code is roughly the same as x[is.na(x)] <- 0.

I was also hoping to make the change upstream in zoo, but time is a constraint, as always.

joshuaulrich avatar Dec 13 '18 13:12 joshuaulrich

Agreed. No need if the existing call is the same speed for all intents and purposes.

Chris Sent from my iPhone

On Dec 13, 2018, at 07:23, Joshua Ulrich [email protected] wrote:

@ckatsulis the code @harvey131 posted isn't in xts, let alone exported. I'm not likely to add a dependency on compiled code unless there's significant performance improvement over a native R solution. In this case, the Rcpp code is roughly the same as x[is.na(x)] <- 0.

I was also hoping to make the change upstream in zoo, but time is a constraint, as always.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ckatsulis avatar Dec 13 '18 13:12 ckatsulis

Idea: zoo::na.fill0() is significantly faster in this case. Consider updating zoo:::na.fill.zoo() to call na.fill0() in this type of special case, or create a na.fill.xts() method that does it. I'd prefer to keep it all in zoo, so it's in one place.

joshuaulrich avatar Oct 12 '22 22:10 joshuaulrich