future.apply
future.apply copied to clipboard
Possible bug in loading packages
This was a bit hard to track down... Here goes.
data.table
and bit
both have setattr
functions. But the annoying thing about bit:setattr
is that it returns NULL. data.table::setattr
returns the input object with the attribute modified invisibly. So, if you were to write a code like:
y <- setattr(18004L, "class", "Date") # today's date
# [1] "2019-04-18"
(I'm not saying this is how one should go about it, but there are other cases where we need to set an attribute and assign the result to an object.)
In this case, depending on what setattr
we've, it'll return the right expected result or NULL
.
With this, consider this code:
require(parallel)
require(doSNOW)
require(foreach)
require(future)
require(future.apply)
require(data.table)
foo <- function(dt) {
cat(sprintf("[%s] %s", Sys.getpid(), capture.output(environment(setattr))), sep="\n")
setattr(dt, "key", "val")
}
nodes <- 5L
cl <- future::makeClusterPSOCK(nodes)
plan(cluster, workers=cl, persistent=TRUE)
dt <- data.table(x=1, y=2)
ans <- values(
lapply(seq_len(nodes), function(node) {
future({foo(dt)}, packages=c("bit64", "data.table"))
})
)
# [15468] <environment: namespace:data.table>
# [9808] <environment: namespace:data.table>
# [15016] <environment: namespace:data.table>
# [23496] <environment: namespace:data.table>
# [22312] <environment: namespace:data.table>
I've created a data.table
in the local environment (just 1 for simplicity) and am calling a function that sets the attribute in parallel. Of course this function is terribly simplified as well.
Now, the way this works (as expected), is to first load bit64
first and data.table
next and then look for functions in foo
and possibly load more packages and then run foo()
(AFAICT).
And this works fine as you can see from the output. If you were to check ans
, you'd get the data.table with their attributes set.
Restart session (IMPORTANT). Now, with everything else remaining intact, if instead of running values(...)
, I use future_lapply
:
ans <- future_lapply(cl, function(node) foo(dt), future.packages=c("bit64", "data.table"))
> ans <- future_lapply(cl, function(node) foo(dt), future.packages=c("bit64", "data.table"))
# [11976] <environment: namespace:bit>
# [23212] <environment: namespace:bit>
# [11732] <environment: namespace:bit>
# [22964] <environment: namespace:bit>
# [5748] <environment: namespace:bit>
> ans
# [[1]]
# NULL
#
# [[2]]
# NULL
#
# [[3]]
# NULL
#
# [[4]]
# NULL
#
# [[5]]
# NULL
Note how setattr
refers to bit
package.. I think this is because the packages get loaded after assessing setattr
is used in foo
and data.table
from local env has a function called setattr
and therefore gets loaded first followed by all packages in future.packages
?
Thxs and sorry for the slow reply.
Yes, I can see how this can happen for PSOCK clusters, e.g. plan(cluster, ...)
and plan(multisession)
. The problem is, as you suggest, that the search()
path is different in the worker compared to the main R session (e.g. with plan(sequential)
). What complicates it, is that something the search()
path gets updated automatically by some other parallel/future call, i.e. it's not always obvious how, why, and when.
To really fix this for this parallel backend, one would need to come up with a way for each future to re-arrange the search()
path on the worker(s) to be the same as that of the main R process. That's an issue to be solved in the future package.
What I think you're also reporting here is an inconsistency between future.apply and rolling your own version via plain futures. If so, I'm not sure why there is a difference, but it could be that packages to be attached are also automatically identified and has higher precedence than those you specify manually. Maybe this inconsistency can/should be fixed.
PS. You wanna use library()
- not require()
. And I don't think your example need to attach none of parallel, doSNOW, and foreach.