remotePARTS
remotePARTS copied to clipboard
fitAR_map needs parallel version
Problem
With large datasets, fitAR_map()
(and perhaps fitCLS_map()
) is exceptionally slow - it can take longer than the partitioned GLS, since it only runs in serial (1 thread). This is clearly an oversight.
Solution
This functionality lends itself to parallelization well. AR models are fit to each pixel independently. So, it makes sense to implement a parallel version of fitAR_map()
via an argument ncores
.
Early on in the development of remotePARTS
, I gave examples about how to apply fitAR()
to a full map with the raster
package. But, since this package doesn't officially support rasters, an implementation that can handle matrices (and dataframes) is important.
I've been successful with a script containing the following code that seems to work well:
iterator-function.R:
iblkrow <- function(a, chunks) {
n <- nrow(a)
i <- 1
nextElem <- function() {
if (chunks <= 0 || n <= 0) stop('StopIteration')
m <- ceiling(n / chunks)
r <- seq(i, length=m)
i <<- i + m
n <<- n - m
chunks <<- chunks - 1
a[r,, drop=FALSE]
}
structure(list(nextElem=nextElem), class=c('iblkrow', 'iter'))
}
nextElem.iblkrow <- function(obj) obj$nextElem()
parallel-AR.R:
# ... setup
library(doParallel)
library(foreach)
registerDoParallel(core_num)
source("iterator-function.R") # loads helper functions including the iterator iblkrow()
# loop through blocks of the data (using helper function)
AR_results = foreach(y = iblkrow(data_matrix, core_num),
crds = iblkrow(coord_matrix, core_num),
.packages = c("remotePARTS", "foreach"), .inorder = TRUE, .combine = rbind) %dopar% {
# loop through each row of the chunks
out = foreach(i=seq_len(nrow(y)), .combine = rbind, .inorder = TRUE) %do% {
t = seq_len(ncol(y)) # time
AR = fitAR(as.formula(as.vector(y[i, ]) ~ t)) # AR fit
# return a collection of AR variables of interest
c(coef = unname(AR$coefficients["t"]),
pval = unname(AR$pval["t"]),
resids = AR$residuals)
}
rownames(out) = NULL # remove rownames "t"
out # return the combined results
}
@morrowcj , thank you very much for ,making the parallel version - this is very helpful! Is there something missing in the second row of the loop? crds = iblkrow(coord_matrix, ...?
You are correct, @PlekhanovaElena. I somehow lost an argument and the closing parenthesis. That line should read crds = iblkrow(coord_matrix, core_num),
. I've edited my comment above to reflect this.
And just in case it wasn't clear, core_num
is the total number of cores you'd like to use for the operation, data_matrix
is a matrix containing the spatial time series data (rows are locations, columns are time steps) and coord_matrix
is the coordinate matrix (rows are locations and columns are longitude and latitude, respectively.
The above code (or something similar) should eventually be implemented in the package. Hopefully I'll have time to get to this soon.