Parallel processing for cache-wide methods
For cache-wide methods such as $clear() and especially $gc(), it would be handy to have some low-overhead mclapply()-powered parallel processing. I am sure @kendonB would appreciate this too.
x <- storr_rds("my_storr")
# ... cache accrues lots of files ...
x$gc(workers = 8) # parallelize over 8 forked processes
You would need to demote workers to 1 for Windows, but I think it is still worth it. parLapply() is platform-independent, but I personally do not like the overhead.
In drake, I use an internal lightly_parallelize() function quite a lot.
lightly_parallelize <- function(X, FUN, jobs = 1, ...) {
jobs <- safe_jobs(jobs)
if (is.atomic(X)){
lightly_parallelize_atomic(X = X, FUN = FUN, jobs = jobs, ...)
} else {
mclapply(X = X, FUN = FUN, mc.cores = jobs, ...)
}
}
lightly_parallelize_atomic <- function(X, FUN, jobs = 1, ...){
jobs <- safe_jobs(jobs)
keys <- unique(X)
index <- match(X, keys)
values <- mclapply(X = keys, FUN = FUN, mc.cores = jobs, ...)
values[index]
}
safe_jobs <- function(jobs){
ifelse(on_windows(), 1, jobs)
}
on_windows <- function(){
this_os() == "windows"
}
this_os <- function(){
Sys.info()["sysname"] %>%
tolower %>%
unname
}
I forgot: $list() is an important one too.
I believe that the disk I/O is the bottleneck for most of these and I'd be shocked if process level parallelism could speed that up
The gpfs file system I'm on seems to have speed benefits for I/O heavy jobs up to around 100 workers.
Even personal hard drives have more than one read/write point I thought?