BiocParallel
BiocParallel copied to clipboard
Add bpexport functionality
bpexport
to make local variables available to remote computation. From the mailing list
I'm taking a stab at this here: https://github.com/DarwinAwardWinner/BiocParallel/tree/bpexport
So far I've added stubs for all the params, and I've added a clusterExport
-based implementation for SnowParam. But thinking about it, that will only work if the cluster is running when clusterExport is called, so even that is not fully implemented.
One issue to consider is, what if we call bpexport
on a SerialParam or a MulticoreParam? They already have access to all the parent's variables, including any changes to those variables' values that occur after the call to bpexport
. Should we make an attempt to have these params match the behavior of e.g. SnowParam by storing a snapshot of the variables when bpexport
is called and then using that snapshot in place of the current value when the param is used?
Also, what should happen when you call bpexport on a stopped cluster? What should happen when you stop a cluster after exporting a variable?
I'd like to suggest creating a simple class/list storing objects exported via bpexport. As soon as bplapply/bpmapply is called the objects can then be put into the function's environment. Something like
exported = list(x = 12, y = rnorm(10))
mapply(assign, x = names(exported), values = exported, MoreArgs=list(envir = environment(FUN))
You would just have to check that environmentName(FUN) != "R_GlobalEnv"
and in this case just give the function a new environment with the GlobalEnv as parent.
I think it's probably a good idea to always give the function a new environment with the exported values and with the function's previous environment as parent. Are you suggesting this for the SerialParam and MulticoreParam classes?
Yes, Serial and Multicore. I also see no drawbacks for BatchJobs over its internal export mechanism. I don't know if this is applicable for DoPar. You could pass them to .export in foreach, but I was unable to find a way to turn the heuristic auto-export off.
On more thing to consider is the expected behavior if a variable is explicitly exported and also defined in the function's environment. Variables in the function's env have precedence in the lookup which deviates from the lookup using parallel/clusterExport (which assigns to GlobalEnv on the slaves).
Well, I think the goal would be in all cases to keep the behavior consistent across all param classes. So to answer what happens when you export a variable and the same variable is defined in the function's environment, we ask what happens naturally in the case of ShowParam where you use clusterExport
to implement bpexport
, and then make sure we do the same thing for the other params, right? I actually don't know how (or if) function environments get transferred between processes by snow and others.
Actually, to be honest, I'm probably not the best person to implement this, because the vast majority of the time I want to do parallel stuff in R, I use multicore, so I never have to worry about exporting variables and I have no real idea how to do it.
Thinking about it, we should probably take this same "just-in-time export" approach for SnowParam as well. This will solve the problem of the cluster not being running when bpexport is called.
Ok, I am finding myself using BatchJobsParam a lot and wanting export functionality, so I will try to work on this some time soon.