Using the future package to aid in determining symbols and packages to export

Open richcalaway opened this issue 5 years ago • 8 comments

Background

Original implementation

In April 2018, in an unreleased version 1.4.6 of foreach, Rich Calaway introduced a new feature that used Henrik Bengtsson's future package (if available) to determine the appropriate symbols and packages to export to the foreach workers. In the initial implementation, future was used if it was installed and its namespace could be loaded via requireNamespace. Henrik responded with the following comments:

COMMENT #1: Surprised users

When future::getGlobalsAndPackages() will be used or not might surprise users. Your proposal to use it when

requireNamespace("future", quietly=TRUE) == TRUE

risks confusing the end users and whoever helps troubleshooting issues related to globals. They might get one result one day, and another result the next day, just because the 'future' package happened to be installed in between. The same if they run on different systems. If someone runs into issues related to globals, the troubleshooting counter questions have to involve: "Do you have the 'future' package installed?" (unless they share a proper sessionInfo() that is).

One slightly less confusing condition would be to condition it on:

("future" %in% loadedNamespaces())

I've seen that style used in some packages. The user has to enable the "feature" by making sure a package is loaded. However, that is still not transparent to the end user since the 'future' package might be loaded as a side effect by some other package. For instance, it may be loaded when using registerDoSeq(), but then not when they retry with registerDoParallel().

It might be better to let the user explicitly control this via an R option, e.g.

useFuture <- getOption("foreach.globalsAs", default = "future")

or default = "foreach". A middle ground, during your migration, is to use something like:

useFuture <- getOption("foreach.globalsAs", default = NULL)
if (is.null(useFuture)) {
useFuture <- requireNamespace("future", quietly=TRUE)
if (useFuture) {
warning('foreach() will identify globals and packages using the
future package because that package is installed. To suppress this
warning set options(foreach.globalsAs = "future"). To use the
traditional approach of foreach for identifying globals, use
options(foreach.globalsAs = "foreach").')
} else {
warning('foreach() will identify globals and packages using the
foreach package because the alternative based on the future package
requires that future is installed. To suppress this warning set
options(foreach.globalsAs = "foreach"). To use future for identifying
globals, install that package.')
}
}

COMMENT #2: Maximum total size of globals

future::getGlobalsAndPackages() has a built-in protection for exporting too large amounts of globals. It's currently set to 500 MiB, and currently only controlled via option 'future.globals.maxSize'. To minimize the surprise here, you might wanna temporarily set this to +Inf to disable this check/assertion, e.g.

gp <- local({
oopts <- options(future.globals.maxSize = +Inf)
on.exit(options(oopts))
future::getGlobalsAndPackages(ex, envir = env)
})

FYI, in the next release of the future package, you'll be able to do future::getGlobalsAndPackages(..., maxSize = +Inf).

COMMENT #3: future+globals vs foreach (<= 1.4.5)

I've identified one use case where some people might say that the 'globals' package is too conservative. The case is when a variable is global or local conditionally on some other variable/state, e.g.

{
if (runif(1) < 1/2) y <- 0
y
}

In future+globals, the above will NOT pick up 'y' as a global variable (*), whereas in foreach (<= 1.4.5), it is identified as a global variable (because you are using a more liberal approach). Unfortunately, I don't this there is an easy solution to handle the above ambiguous case, where a variable is global or local depending on the run-time state, without adding significant processing overhead. OTH, I think this is an oversight by the developer (or at least a deliberate hack) and I don't mind "forcing" code to break in these ambiguous cases since it's quite easy to make it non-ambiguous. So far I've only identified one case of this in the wild:

example("avNNet", package = "caret", run.dontrun = TRUE)

so I think you shouldn't expect much reports on that.

(*) The main reason for future+globals failing to find 'y' here is because it is a side-effect of its "ordered" code inspection for the purpose of identifying 'x' as a global in { x <- x + 1 }. I'm tracking this over at https://github.com/HenrikBengtsson/globals/issues/31

First revised implementation

Rich revised the implementation in foreach 1.5.0/1.5.1, adding a check for a global option foreachGlobals to see if the user preferred the original foreach functionality (which found some global symbols but did not search for additional package exports), and expanding the symbol search to include the union of those found by future's getGlobalsAndPackages function and those found by foreach's original method.

This revision addressed issue #3 and part of #1, but did not address the size of globals problem.

Henrik responded with the following additional comments (and Rich responded with >RBC inline comments):

DOCUMENTATION: A user who sets options(foreachGlobals = "foreach") may wonder how to undo that. From the code, it looks like one can do this by unsetting the option, i.e. options(foreachGlobals = NULL). It would help to document that too.

RBC --Agreed

DOCUMENTATION: It says "Beginning with foreach 0.5.0, foreach will use the future package, if available, to automatically detect needed packages." Without seeing the code, the term "available" is a bit ambiguous to the user. Maybe write "..., if it is installed and can be loaded, ..." instead. And clarify further by "If the future package is not installed, then the identification of globals will be done as if options(foreachGlobals = "foreach") was set."

RBC I'll work on this.

SURPRISE FACTOR: The silent, conditional requireNamespace("future") behavior may still be confusing, surprising, ... and makes it hard to troubleshoot. It's likely that there'll be comments like "weird, it works for me" kind of discussions.

RBC -- Agreed, but that's part of the point--I want to get the benefit of future as seamlessly as possible. RBC If users need to set an option, most users won't.

CLEANUP: To avoid loading 'future' when not needed, you might wanna swap the order to if (!identical(getOption("foreachGlobals"), "foreach") && requireNamespace("future", quietly=TRUE)){){ ... }

RBC: Good idea!

PREDICTABILITY: You could introduce options(foreachGlobals = "future+foreach"), which if set, will produce an error if the 'future' package is not installed. OTH, not sure what the current default should be named - maybe "future+foreach-or-foreach"?

doFuture: If you could support/standardize on options(foreachGlobals = "future+foreach") and possible also options(foreachGlobals = "future"), then I could rely on that in doFuture rather than adding another option.

RBC: Yes, I think I can do this.

ROBUSTNESS: I've also played around with the idea of something like options(foreachGlobals = "manual") which will disable the automatic identification of globals and rely solely on arguments .export and '.packages`.

RBC: Another good idea.

Henrik responded to Rich's responses with the following additional comments:

I still argue its a bad idea that the behavior is conditioned on whether you have a package installed or not. What is even worse is that the dependent package (here 'future') may be installed at a random time because it is installed together with some other package. All of a sudden the behavior changes and the user has no idea why because they did not change anything.

BTW, do you have a better place than an email thread where this can be discussed and be properly documented? There's a great risk that things are getting lost in nested email threads. I would love if you would bring foreach to GitHub where the issue tracker provides an excellent communication channel. I believe there's valueable feedback from other developers that would reach you if you'd be on GitHub. Also, I have two suggestions that I think would improve foreach a lot- but I'll wait with those for now.

The second of these final comments is addressed with this issue; foreach has indeed been brought to GitHub and this issue is the first to be set into the issue tracker.

Current Development

Rich has handed off maintenance of foreach to Hong Ooi, but is leaving one final pass at the future integration as a branch richcala/foreachFutureTake3. This addresses most of the remaining issues--including the surprise factor. In this implementation, if the user is not using Microsoft R Open, the future option is added only if one of the foreachGlobals options "future+foreach", "foreach+future", or "future" is explicitly set. All of those act the same, however--the union of future and foreach previously implemented.

Aug 01 '19 16:08 richcalaway

foreach foreach copied to clipboard

Using the future package to aid in determining symbols and packages to export

Background

Original implementation

First revised implementation

Current Development

foreach
foreach copied to clipboard