BiocParallel
BiocParallel copied to clipboard
BiocParallel fail to start with MPI
Hello Everyone, We are having trouble running BiocParallel within our SLURM cluster environment.
The foo.R script we are trying to run is
library("BiocParallel")
library("Rmpi")
param <- SnowParam(workers = 3, type = "MPI")
FUN <- function(i) system("hostname", intern=TRUE)
bplapply(1:6, FUN, BPPARAM = param)
If we request an interactive job allocation, for example with salloc -p mpi -N 2 -n 4 -t 1:00:00
and then start R with:
mpiexec -np 1 R --no-save
and run the above script from this interactive shell we have as expected:
> library("BiocParallel")
library("BiocParallel")
> library("Rmpi")
library("Rmpi")
> param <- SnowParam(workers = 3, type = "MPI")
param <- SnowParam(workers = 3, type = "MPI")
> FUN <- function(i) system("hostname", intern=TRUE)
FUN <- function(i) system("hostname", intern=TRUE)
> bplapply(1:6, FUN, BPPARAM = param)
bplapply(1:6, FUN, BPPARAM = param)
3 slaves are spawned successfully. 0 failed.
[[1]]
[1] "compute-a-16-21"
[[2]]
[1] "compute-a-16-21"
[[3]]
[1] "compute-a-16-22"
[[4]]
[1] "compute-a-16-22"
[[5]]
[1] "compute-a-16-22"
[[6]]
[1] "compute-a-16-22"
However if we try to run the same R script from within a sbatch job with:
#!/bin/bash
#SBATCH -p mpi
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -t 2:00:00
mpiexec -np 1 Rscript foo.R # or R CMD BATCH foo.R
The execution hangs for several seconds and eventually fails with the MPI error:
[compute-a-16-21:10780] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 193
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
Does anyone have any idea of why the primary R process is failing to start the other tasks?
Thank you Raffaele
Update: starting the batch job with
mpiexec -np 1 R --no-save --file=foo.R
instead of R CMD BATCH
or Rscript
seems to work.
The execution still ends with a bad OMPI since the task just dies out there, but at least it does run the hostname on the distributed system
Can you try using the BiocParallel::BatchToolsParam()
interface and try it on your SLURM cluster?