callr icon indicating copy to clipboard operation
callr copied to clipboard

Trouble calling future within r_session$new() but not r_bg()

Open wlandau opened this issue 2 years ago • 9 comments

This question is related to https://github.com/HenrikBengtsson/future/discussions/607, but I am not sure if the solution lies with my usage of future or my usage of callr.

Are there differences in environment variables etc. between r_session$new() and r_bg()? Is there anything I can do to make configure the R session/environment of the former be more like the latter.

I am developing a package that requires submitting a future inside a separate local process, and because of a some implementation details, I would prefer to use r_session$new() rather than r_bg(). But when I try the former, I get an error:

fun <- function() {
  plan <- future::tweak(
    future.batchtools::batchtools_sge,
    template = "sge.tmpl"
  )
  future::plan(plan) # Runs on my company's SGE cluster
  future::future("x")
}
px <- callr::r_session$new()
px$call(func = fun, args = list())
out <- px$read()
cat(out$error$message)
#> callr subprocess failed: Fatal error occurred: 101.
#> Command 'qsub' produced exit code 1.
#> Output: 'Unable to run job: got no response from JSV script
#> "/opt/uge/uge-8.6.6/util/resources/jsv/verify_job".
#> Exiting.

This happens both in the RStudio IDE and in a terminal. In both cases, it works with callr::r_bg().

px <- callr::r_bg(func = fun, args = args)
px$get_result()

Here is my sge.tmpl file:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -o <%= log.file %>
#$ -V
#$ -N <%= job.name %>
module load R/4.1.2
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
exit 0

wlandau avatar Apr 15 '22 13:04 wlandau

By the way, for the package I mentioned, I plan to implement task queues for cloud workers, and I am gradually working up to the challenge through callr and future/future.batchtools. @gaborcsardi, your task queue blog post from 2019 was extremely helpful, and motivated the design I am using for all the queues. I credit you in the NOTICE and comments, and I will add a note to the README as well.

wlandau avatar Apr 15 '22 13:04 wlandau

Are there differences in environment variables etc. between r_session$new() and r_bg()?

Maybe, but you can print/save the environment variables in both and compare.

Unfortunately it is challenging for me to reproduce this, so there isn't much I can do I am afraid.

gaborcsardi avatar Apr 15 '22 15:04 gaborcsardi

That's okay, I understand.

wlandau avatar Apr 15 '22 17:04 wlandau

@wlandau Do you think it is possible to create a self contained docker container that reproduces this?

gaborcsardi avatar Apr 17 '22 12:04 gaborcsardi

I can’t promise I will have enough time, but it is a good idea.

wlandau avatar Apr 19 '22 02:04 wlandau

I did figure out how to reproduce this without future or batchtools. With this jobs.sh script:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -V
#$ -N test
sleep 5

This code reproduces the JSV script error:

fun <- function() system2("qsub", "job.sh")
px <- callr::r_session$new()
px$call(func = fun, args = list())
out <- px$read()
out
cat(out$error$message)
#> callr subprocess failed: Fatal error occurred: 101.
#> Command 'qsub' produced exit code 1.
#> Output: 'Unable to run job: got no response from JSV script
#> "/opt/uge/uge-8.6.6/util/resources/jsv/verify_job".
#> Exiting.

and this job runs successfully.

fun <- function() system2("qsub", "job.sh")
px <- callr::r_bg(fun = fun)
out <- px$get_result()
out

I will see if my sys admin knows what I could do to troubleshoot.

wlandau avatar Apr 20 '22 12:04 wlandau

That's a good step. But I suspect that I would still need to set up an SGE cluster to run qsub.

gaborcsardi avatar Apr 20 '22 12:04 gaborcsardi

I asked my sys admin about containerizing SGE, but unfortunately he did not seem to think that was feasible. We debugged for a while, and he plans to send strace output to an SGE developer. I could share the trace with you by email or some other way that is not public-facing.

Also, I noticed that r_session$new() runs R --no-readline --slave --no-save --no-restore. I thought my issue might have something to do with command line flags, so I tried R --no-readline --slave --no-save --no-restore -e 'system2("qsub", "job.sh")', but the job submitted successfully.

wlandau avatar Apr 25 '22 13:04 wlandau

FWIW there are some (old) dockerfiles with SGE, e.g. https://github.com/stevekm/docker-centos6-sge

There is also this old recipe, which probably does not work any more: https://gist.github.com/dan-blanchard/6586533

Sure, you can send the strace to me in email.

gaborcsardi avatar Jun 21 '22 13:06 gaborcsardi

Confirmed fixed in https://github.com/r-lib/processx/commit/1082c9db2345b8dfa5c45eb88711a42a0b681ae7. Thanks @gaborcsardi!

wlandau avatar Oct 09 '22 13:10 wlandau