How to minimise retransmission of objects when I use SSH connections?

Open seonghobae opened this issue 6 years ago • 1 comments

Hello, I’m using future apply function with ssh connections, cluster plan. However, future apply function makes retransmission of data frame or objects every mapping procedures even I use same datasets or objects when I’m just changing parameters of estimations to find optimal condition.

Pseudo code is here.

data <- mirt::Science
nFactors <- 1:4
future::plan('cluster', workers = paste0('s',1:2))
future.apply::futute_lapply(X=nFactors, FUN = function(X, data){mirt::mirt(data = data, X)}, data = data)

After run this code, let’s watch traffic status. It seems do retransmission of data, even I don’t change any data for the parameter estimation.

How can I reduce data retransmission? That’s make me hard to operate HPC computing on some VPS provider, they makes QoS limit every my calculation.

Best, Seongho

Sep 02 '19 09:09 seonghobae

Short answer: The future framework does not really support "life-time" global variables, which stems from the design that futures are meant to be independent of each other.

Long answer: However, one could imagine that parallel backends, such as PSOCK clusters, that server multiple futures, could hold "life-time" globals. We can actually already do things such as:

library(future)
cl <- future::makeClusterPSOCK(2)
A <- data.frame(a=1:3, b=4:6)
parallel::clusterExport(cl, "A")
plan(cluster, workers = cl, persistent=TRUE)

to export object data to each parallel workers upfront. So far so good. However, if we would just do:

y0 <- lapply(colnames(data), FUN = function(name) {
  sum(data[[name]])
})

y1 <- future.apply::future_lapply(colnames(data), FUN = function(name) {
  sum(data[[name]])
})

stopifnot(identical(y1, y0))

we would still export data in each iteration (and overwrite the ones we exported manually). To avoid this, one could tell the future framework to ignore data even if it finds it to be a global variable;

y2 <- future.apply::future_lapply(colnames(data), FUN = function(name) {
  sum(data[[name]])
}, future.globals = structure(TRUE, ignore = "data"))

stopifnot(identical(y2, y0))

WARNING: Now, the above is really hacky and should not be used in production. It has two main problems, which are against the philosophy of how futures should be used:

It relies on persistent = TRUE for the cluster backend. I don't recommend to use that because it is unsafe and risk messing up reproducibility.
The use of future.globals = structure(TRUE, ignore = "data") relies on the data object already existing on the parallel worker. If you change to, say, plan(future.callr::callr), your code will break. So, that is also not recommended; ignore should really only be used to ignore false-positive globals and not the way it is used here.

So, take-home message, unfortunately, the future framework does not support what you're asking for as it stands now. However, it might be that we can introduce the concept of "lifetime globals", or "worker globals", that one can set up once and if there is a regular global variable to be exported that matches an existing "worker global", then the export will be skipped. Now, I doubt such a feature will not be implemented in the future framework anytime soon.

Nov 02 '19 04:11 HenrikBengtsson