BiocParallel
BiocParallel copied to clipboard
Import of BigWig files with rtracklayer fails under MultiCoreParam
I'm seeing a weird bug in in my CAGEfightR package when trying to import BigWigFiles.
Here's a reproducible example:
# Load the example data
library(BiocParallel)
library(rtracklayer)
library(CAGEfightR)
data('exampleDesign')
# Example bigwig files
bw_plus <- system.file('extdata', exampleDesign$BigWigPlus,
package = 'CAGEfightR')
bw_plus <- BigWigFileList(bw_plus)
# Genome
mm9 <- SeqinfoForUCSCGenome("mm9")
grl <- tileGenome(mm9, ntile=3)
# Import, knowing that some areas will be empty
import(bw_plus[[1]], which=grl[[1]]) # empty
import(bw_plus[[1]], which=grl[[2]]) # empty
import(bw_plus[[1]], which=grl[[3]]) # has signal
# This works
lapply(bw_plus, import, which=grl[[1]])
lapply(bw_plus, import, which=grl[[2]])
lapply(bw_plus, import, which=grl[[3]])
# This works
register(SerialParam())
bplapply(bw_plus, import, which=grl[[1]])
bplapply(bw_plus, import, which=grl[[2]])
bplapply(bw_plus, import, which=grl[[3]])
# This works
register(MulticoreParam(3))
bplapply(bw_plus, import)
# This fails!
register(MulticoreParam(3))
bplapply(bw_plus, import, which=grl[[1]])
bplapply(bw_plus, import, which=grl[[2]])
bplapply(bw_plus, import, which=grl[[3]])
To summarize:
- You can import entire BigWig-files using bplapply.
- You can import part of a BigWig using bplapply and SerialParam
- You cannot import an empty tile from a BigWig file with bplapply and MultiCoreParam.
What's causing this strange behavior?
Can you clarify what 'fails' mean? I see
> lapply(bw_plus, import, which=grl[[1]])
[[1]]
GRanges object with 0 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
-------
seqinfo: 2 sequences from an unspecified genome
[[2]]
GRanges object with 0 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
-------
seqinfo: 2 sequences from an unspecified genome
[[3]]
GRanges object with 0 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
-------
seqinfo: 2 sequences from an unspecified genome
Warning messages:
1: In .local(con, format, text, ...) :
'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
2: In .normarg_seqnames2(seqnames, seqinfo) :
levels in 'seqnames' with no entries in 'seqinfo' were dropped
3: In .local(con, format, text, ...) :
'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
4: In .normarg_seqnames2(seqnames, seqinfo) :
levels in 'seqnames' with no entries in 'seqinfo' were dropped
5: In .local(con, format, text, ...) :
'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
6: In .normarg_seqnames2(seqnames, seqinfo) :
levels in 'seqnames' with no entries in 'seqinfo' were dropped
> bplapply(bw_plus, import, which=grl[[1]])
[[1]]
GRanges object with 0 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
-------
seqinfo: 2 sequences from an unspecified genome
[[2]]
GRanges object with 0 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
-------
seqinfo: 2 sequences from an unspecified genome
[[3]]
GRanges object with 0 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
-------
seqinfo: 2 sequences from an unspecified genome
Warning messages:
1: In .local(con, format, text, ...) :
'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
2: In .normarg_seqnames2(seqnames, seqinfo) :
levels in 'seqnames' with no entries in 'seqinfo' were dropped
3: In .local(con, format, text, ...) :
'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
4: In .normarg_seqnames2(seqnames, seqinfo) :
levels in 'seqnames' with no entries in 'seqinfo' were dropped
5: In .local(con, format, text, ...) :
'which' contains seqnames not known to BigWig file: chr1, chr2, chr3, chr4, chr5, chr6
6: In .normarg_seqnames2(seqnames, seqinfo) :
levels in 'seqnames' with no entries in 'seqinfo' were dropped
Fails here means an error is raised, and a rather cryptic one relating to the parallel processing.
When calling import with the which argument, it will produce a warning if you request one or more seqlevels not in the bigwig file.
can you cut-and-paste the command and error message? It doesn't fail in this way for me... Also please add the output of sessionInfo()
and BiocManager::valid()
Sure, continuing from the example in the first post:
> bplapply(bw_plus, import, which=grl[[1]])
Error in result[[njob]] <- value :
attempt to select less than one element in OneIndex
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
3 parallel jobs did not deliver results
Session info:
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin19.2.0 (64-bit)
Running under: macOS Catalina 10.15.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] CAGEfightR_1.7.1 SummarizedExperiment_1.16.1 DelayedArray_0.12.2
[4] BiocParallel_1.20.1 matrixStats_0.55.0 Biobase_2.46.0
[7] rtracklayer_1.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
[10] IRanges_2.20.2 S4Vectors_0.24.3 BiocGenerics_0.32.0
[13] rstudioapi_0.11
loaded via a namespace (and not attached):
[1] ProtGenerics_1.18.0 bitops_1.0-6 bit64_0.9-7
[4] RColorBrewer_1.1-2 progress_1.2.2 httr_1.4.1
[7] GenomicFiles_1.22.0 tools_3.6.2 backports_1.1.5
[10] R6_2.4.1 rpart_4.1-15 Hmisc_4.3-1
[13] DBI_1.1.0 lazyeval_0.2.2 Gviz_1.30.3
[16] colorspace_1.4-1 nnet_7.3-13 tidyselect_1.0.0
[19] gridExtra_2.3 prettyunits_1.1.1 bit_1.1-15.2
[22] curl_4.3 compiler_3.6.2 htmlTable_1.13.3
[25] microbenchmark_1.4-7 scales_1.1.0 checkmate_2.0.0
[28] askpass_1.1 rappdirs_0.3.1 stringr_1.4.0
[31] digest_0.6.25 Rsamtools_2.2.3 foreign_0.8-76
[34] XVector_0.26.0 dichromat_2.0-0 base64enc_0.1-3
[37] jpeg_0.1-8.1 pkgconfig_2.0.3 htmltools_0.4.0
[40] regioneR_1.18.1 ensembldb_2.10.2 BSgenome_1.54.0
[43] dbplyr_1.4.2 htmlwidgets_1.5.1 rlang_0.4.5
[46] pryr_0.1.4 RSQLite_2.2.0 DelayedMatrixStats_1.8.0
[49] acepack_1.4.1 dplyr_0.8.4 VariantAnnotation_1.32.0
[52] RCurl_1.98-1.1 magrittr_1.5 GenomeInfoDbData_1.2.2
[55] Formula_1.2-3 Matrix_1.2-18 Rcpp_1.0.3
[58] munsell_0.5.0 lifecycle_0.1.0 stringi_1.4.6
[61] zlibbioc_1.32.0 BiocFileCache_1.10.2 grid_3.6.2
[64] blob_1.2.1 crayon_1.3.4 lattice_0.20-40
[67] Biostrings_2.54.0 splines_3.6.2 GenomicFeatures_1.38.2
[70] hms_0.5.3 knitr_1.28 pillar_1.4.3
[73] codetools_0.2-16 biomaRt_2.42.0 XML_3.99-0.3
[76] glue_1.3.1 biovizBase_1.34.1 latticeExtra_0.6-29
[79] data.table_1.12.8 png_0.1-7 vctrs_0.2.3
[82] gtable_0.3.0 openssl_1.4.1 purrr_0.3.3
[85] assertthat_0.2.1 ggplot2_3.2.1 xfun_0.12
[88] AnnotationFilter_1.10.0 survival_3.1-8 tibble_2.1.3
[91] GenomicAlignments_1.22.1 AnnotationDbi_1.48.0 memoise_1.1.0
[94] cluster_2.1.0
BiocManager::valid():
* sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin19.2.0 (64-bit)
Running under: macOS Catalina 10.15.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] CAGEfightR_1.7.1 SummarizedExperiment_1.16.1 DelayedArray_0.12.2
[4] BiocParallel_1.20.1 matrixStats_0.55.0 Biobase_2.46.0
[7] rtracklayer_1.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.0
[10] IRanges_2.20.2 S4Vectors_0.24.3 BiocGenerics_0.32.0
[13] rstudioapi_0.11
loaded via a namespace (and not attached):
[1] ProtGenerics_1.18.0 bitops_1.0-6 bit64_0.9-7
[4] RColorBrewer_1.1-2 progress_1.2.2 httr_1.4.1
[7] GenomicFiles_1.22.0 tools_3.6.2 backports_1.1.5
[10] R6_2.4.1 rpart_4.1-15 Hmisc_4.3-1
[13] DBI_1.1.0 lazyeval_0.2.2 Gviz_1.30.3
[16] colorspace_1.4-1 nnet_7.3-13 tidyselect_1.0.0
[19] gridExtra_2.3 prettyunits_1.1.1 bit_1.1-15.2
[22] curl_4.3 compiler_3.6.2 htmlTable_1.13.3
[25] microbenchmark_1.4-7 scales_1.1.0 checkmate_2.0.0
[28] askpass_1.1 rappdirs_0.3.1 stringr_1.4.0
[31] digest_0.6.25 Rsamtools_2.2.3 foreign_0.8-76
[34] XVector_0.26.0 dichromat_2.0-0 base64enc_0.1-3
[37] jpeg_0.1-8.1 pkgconfig_2.0.3 htmltools_0.4.0
[40] regioneR_1.18.1 ensembldb_2.10.2 BSgenome_1.54.0
[43] dbplyr_1.4.2 htmlwidgets_1.5.1 rlang_0.4.5
[46] pryr_0.1.4 RSQLite_2.2.0 DelayedMatrixStats_1.8.0
[49] acepack_1.4.1 dplyr_0.8.4 VariantAnnotation_1.32.0
[52] RCurl_1.98-1.1 magrittr_1.5 GenomeInfoDbData_1.2.2
[55] Formula_1.2-3 Matrix_1.2-18 Rcpp_1.0.3
[58] munsell_0.5.0 lifecycle_0.1.0 stringi_1.4.6
[61] zlibbioc_1.32.0 BiocFileCache_1.10.2 grid_3.6.2
[64] blob_1.2.1 crayon_1.3.4 lattice_0.20-40
[67] Biostrings_2.54.0 splines_3.6.2 GenomicFeatures_1.38.2
[70] hms_0.5.3 knitr_1.28 pillar_1.4.3
[73] codetools_0.2-16 biomaRt_2.42.0 XML_3.99-0.3
[76] glue_1.3.1 biovizBase_1.34.1 latticeExtra_0.6-29
[79] BiocManager_1.30.10 data.table_1.12.8 png_0.1-7
[82] vctrs_0.2.3 gtable_0.3.0 openssl_1.4.1
[85] purrr_0.3.3 assertthat_0.2.1 ggplot2_3.2.1
[88] xfun_0.12 AnnotationFilter_1.10.0 survival_3.1-8
[91] tibble_2.1.3 GenomicAlignments_1.22.1 AnnotationDbi_1.48.0
[94] memoise_1.1.0 cluster_2.1.0
Bioconductor version '3.10'
* 2 packages out-of-date
* 1 packages too new
create a valid installation with
BiocManager::install(c(
"CAGEfightR", "MultiAssayExperiment", "nloptr"
), update = TRUE, ask = FALSE)
more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date
Warning message:
2 packages out-of-date; 1 packages too new
Your reproducible example doesn't reproduce for me ;) does
bplapply(1:5, log, base = 2)
work?
Ah, the plot thickens then!
It seems bplapply fails specifically when run with MultiCoreParam AND the which argument is used with import.
Here's another example (reusing files from the original example):
# bplapply test works (no warnings)
z1 <- bplapply(1:5, log, base = 2)
# Single import works (only raises a warning)
y1 <- import(bw_plus[[1]], which=grl[[1]]) # empty
y2 <- import(bw_plus[[1]], which=grl[[2]]) # empty
y3 <- import(bw_plus[[1]], which=grl[[3]]) # has signal
# All these works (only raises a warning)
x1 <- lapply(bw_plus, import)
x2 <- bplapply(bw_plus, import, BPPARAM=SerialParam())
x3 <- bplapply(bw_plus, import, BPPARAM=MulticoreParam(3))
x4 <- lapply(bw_plus, import, which=grl[[1]])
x5 <- bplapply(bw_plus, import, which=grl[[1]], BPPARAM=SerialParam())
# This specific case raises an error!
x6 <- bplapply(bw_plus, import, which=grl[[1]], BPPARAM=MulticoreParam(3))
Error in result[[njob]] <- value :
attempt to select less than one element in OneIndex
In addition: Warning messages:
1: In parallel::mccollect(wait = TRUE) :
1 parallel job did not deliver a result
2: In parallel::mccollect(wait = FALSE, timeout = 1) :
3 parallel jobs did not deliver results
I have no idea how to decipher the error message. Could it be:
- RStudio-related?
- rtracklayer::import related?
- macOS related?
I originally encountered the problem using GenomicFiles::reduceByRange, which needs to load separate parts of a BigWigFile.
And some more testing:
Both of these work:
x7 <- bplapply(bw_plus, import, which=grl[[1]], BPPARAM=MulticoreParam(1))
x8 <- bplapply(bw_plus, import, which=grl[[1]], BPPARAM=SnowParam(2))
So MultiCoreParam only fails when more than 1 core is used. SnowParam works without errors.
FYI, they recommend against using forked parallel processing in the RStudio Console. I would rule out that player. Try to replicate it without RStudio by running R in the terminal.
Still gives an error message in the terminal:
Next I would try with:
x <- parallel::mclapply(bw_plus, import, which=grl[[1]], mc.cores=3L)
Aha, Mother Of All Error Messages:
> x <- parallel::mclapply(bw_plus, import, which=grl[[1]], mc.cores=3L)
*** caught segfault ***
*** caught segfault ***
address 0x10bdb8020, cause 'memory not mapped'address 0x10bdb8020, cause 'memory not mapped'
*** caught segfault ***
address 0x10bdb8020, cause 'memory not mapped'
Traceback:
1: strsplit(x, "\n[ \t\n]*\n", perl = TRUE)
2: lapply(strsplit(x, "\n[ \t\n]*\n", perl = TRUE), strsplit, "[ \t\n]", perl = TRUE)
3: strwrap(paste0(c(...), collapse = ""))
4: paste0(strwrap(paste0(c(...), collapse = "")), collapse = "\n ")
Traceback:
5: 1: wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped")strsplit(x, "\n[ \t\n]*\n", perl = TRUE)
6:
2: warning(wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped"))lapply(strsplit(x, "\n[ \t\n]*\n", perl = TRUE), strsplit, "[ \t\n]",
7: perl = TRUE).normarg_seqnames2(seqnames, seqinfo)
3: 8: strwrap(paste0(c(...), collapse = ""))new_GRanges("GRanges", seqnames = seqnames, ranges = ranges,
strand = strand, mcols = mcols, seqinfo = seqinfo) 4:
paste0(strwrap(paste0(c(...), collapse = "")), collapse = "\n ") 9:
GRanges(rep(seqnames(which), nhits), C_ans[[1L]], seqinfo = si) 5:
wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped")10:
.local(con, format, text, ...) 6:
warning(wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped"))11: FUN(X[[i]], ...)
7:
12: .normarg_seqnames2(seqnames, seqinfo)FUN(X[[i]], ...)
Traceback:
8:
13: new_GRanges("GRanges", seqnames = seqnames, ranges = ranges, 1: lapply(X = S, FUN = FUN, ...)strsplit(x, "\n[ \t\n]*\n", perl = TRUE)
strand = strand, mcols = mcols, seqinfo = seqinfo) 2: 14:
9: lapply(strsplit(x, "\n[ \t\n]*\n", perl = TRUE), strsplit, "[ \t\n]", doTryCatch(return(expr), name, parentenv, handler)GRanges(rep(seqnames(which), nhits), C_ans[[1L]], seqinfo = si) perl = TRUE)
15: 10: 3: tryCatchOne(expr, names, parentenv, handlers[[1L]]).local(con, format, text, ...)strwrap(paste0(c(...), collapse = ""))
16: 11: 4: tryCatchList(expr, classes, parentenv, handlers)paste0(strwrap(paste0(c(...), collapse = "")), collapse = "\n ")FUN(X[[i]], ...)
5: 12: wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped")
FUN(X[[i]], ...)
17:
6: tryCatch(expr, error = function(e) {13: warning(wmsg("levels in 'seqnames' with no entries ", "in 'seqinfo' were dropped")) call <- conditionCall(e)lapply(X = S, FUN = FUN, ...)
if (!is.null(call)) {
7: 14: .normarg_seqnames2(seqnames, seqinfo) if (identical(call[[1L]], quote(doTryCatch)))
doTryCatch(return(expr), name, parentenv, handler) call <- sys.call(-4L) 8: dcall <- deparse(call)[1L]
prefix <- paste("Error in", dcall, ": ")15: new_GRanges("GRanges", seqnames = seqnames, ranges = ranges, LONG <- 75LtryCatchOne(expr, names, parentenv, handlers[[1L]]) strand = strand, mcols = mcols, seqinfo = seqinfo) sm <- strsplit(conditionMessage(e), "\n")[[1L]]
16: 9: tryCatchList(expr, classes, parentenv, handlers) w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")GRanges(rep(seqnames(which), nhits), C_ans[[1L]], seqinfo = si)
if (is.na(w))
17: w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], 10: type = "b")tryCatch(expr, error = function(e) {.local(con, format, text, ...) if (w > LONG) call <- conditionCall(e)
prefix <- paste0(prefix, "\n ") if (!is.null(call)) {11: } if (identical(call[[1L]], quote(doTryCatch))) FUN(X[[i]], ...) else prefix <- "Error : " call <- sys.call(-4L)0m
msg <- paste0(prefix, conditionMessage(e), "\n") dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L12: sm <- strsplit(conditionMessage(e), "\n")[[1L]] .Internal(seterrmessage(msg[1L])) w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) if (!silent && isTRUE(getOption("show.error.messages"))) { w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], cat(msg, file = outFile) type = "b") .Internal(printDeferredWarnings())FUN(X[[i]], ...) if (w > LONG) prefix <- paste0(prefix, "\n ")
} }13: lapply(X = S, FUN = FUN, ...)
14: else prefix <- "Error : "doTryCatch(return(expr), name, parentenv, handler) invisible(structure(msg, class = "try-error", condition = e))
15: msg <- paste0(prefix, conditionMessage(e), "\n")})tryCatchOne(expr, names, parentenv, handlers[[1L]]) .Internal(seterrmessage(msg[1L]))
if (!silent && isTRUE(getOption("show.error.messages"))) {16: 18: cat(msg, file = outFile)tryCatchList(expr, classes, parentenv, handlers)try(lapply(X = S, FUN = FUN, ...), silent = TRUE) .Internal(printDeferredWarnings())
}
17: 19: invisible(structure(msg, class = "try-error", condition = e))tryCatch(expr, error = function(e) {})sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) call <- conditionCall(e)
if (!is.null(call)) {18: 20: if (identical(call[[1L]], quote(doTryCatch))) try(lapply(X = S, FUN = FUN, ...), silent = TRUE)FUN(X[[i]], ...) call <- sys.call(-4L)
dcall <- deparse(call)[1L]19: prefix <- paste("Error in", dcall, ": ")21: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) LONG <- 75Llapply(seq_len(cores), inner.do)
sm <- strsplit(conditionMessage(e), "\n")[[1L]]20: 22: parallel::mclapply(bw_plus, import, which = grl[[1]], mc.cores = 3L) w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")FUN(X[[i]], ...) if (is.na(w))
w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], 21: type = "b")lapply(seq_len(cores), inner.do) if (w > LONG)
prefix <- paste0(prefix, "\n ")22: } else prefix <- "Error : "parallel::mclapply(bw_plus, import, which = grl[[1]], mc.cores = 3L)
msg <- paste0(prefix, conditionMessage(e), "\n")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace .Internal(seterrmessage(msg[1L]))
if (!silent && isTRUE(getOption("show.error.messages"))) {election: m
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
cat(msg, file = outFile) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
18: try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
19: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))
20: FUN(X[[i]], ...)
21: Selection: lapply(seq_len(cores), inner.do)
22: parallel::mclapply(bw_plus, import, which = grl[[1]], mc.cores = 3L)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:
I'm still not able to reproduce this. Can you confirm that a file 'test.R' containing only
# Load the example data
library(BiocParallel)
library(rtracklayer)
library(CAGEfightR)
# Example bigwig files
bw_plus <- system.file('extdata', exampleDesign$BigWigPlus,
package = 'CAGEfightR')
bw_plus <- BigWigFileList(bw_plus)
# Genome
mm9 <- SeqinfoForUCSCGenome("mm9")
grl <- tileGenome(mm9, ntile=3)
register(MulticoreParam(3))
bplapply(bw_plus, import, which=grl[[1]])
generates the error when run from the command line with R --vanilla -f test.R
.
I guess that more careful management of seqlevels means that the warning messages and hence crash goes away...?
bplapply(bw_plus, function(bw, which) {
seqlevels(which, pruning.mode = "coarse") <- seqlevels(bw)
import(bw, which = which)
}, which = grl[[3]])
Still raises an error:
maltethodberg@ip164-91:~/Desktop$ R --vanilla -f test.R
R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin19.2.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> # Load the example data
> library(BiocParallel)
> library(rtracklayer)
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: GenomeInfoDb
> library(CAGEfightR)
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: DelayedArray
Loading required package: matrixStats
Attaching package: ‘matrixStats’
The following objects are masked from ‘package:Biobase’:
anyMissing, rowMedians
Attaching package: ‘DelayedArray’
The following objects are masked from ‘package:matrixStats’:
colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
The following objects are masked from ‘package:base’:
aperm, apply, rowsum
>
> # Example bigwig files
> bw_plus <- system.file('extdata', exampleDesign$BigWigPlus,
+ package = 'CAGEfightR')
> bw_plus <- BigWigFileList(bw_plus)
>
> # Genome
> mm9 <- SeqinfoForUCSCGenome("mm9")
> grl <- tileGenome(mm9, ntile=3)
>
> register(MulticoreParam(3))
> bplapply(bw_plus, import, which=grl[[1]])
Error in result[[njob]] <- value :
attempt to select less than one element in OneIndex
Calls: bplapply ... bplapply -> bplapply -> bplapply -> bploop -> bploop.lapply
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
1 parallel job did not deliver a result
Execution halted
If this is some deep unexplained bug, you're right that it might be easier to modify the code to avoid warnings.
The warning:
Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
1 parallel job did not deliver a result
strongly suggests that the forked worker process died, cf. Issue #112. The reason for why it died seems to be suggested by https://github.com/Bioconductor/BiocParallel/issues/111#issuecomment-595232972:
> x <- parallel::mclapply(bw_plus, import, which=grl[[1]], mc.cores=3L)
*** caught segfault ***
*** caught segfault ***
address 0x10bdb8020, cause 'memory not mapped'address 0x10bdb8020, cause 'memory not mapped'
*** caught segfault ***
address 0x10bdb8020, cause 'memory not mapped'
To me, the problem is most likely not in BiocParallel.
I would:
- Troubleshoot in a plain R and definitely not in RStudio
- Always troubleshoot in a fresh R session, i.e. use
R --vanilla
to avoid.Rprofile
etc. - Replicate with two workers (
mc.cores = 2L
) instead of three. It's not possible to go lower, i.e. don't usemc.cores = 1L
- it's different. - Minimize the example further, e.g. can the error be reproduced with a single BigWig file (now three)?
- Post a new minimal example where BiocParallel is not attached/used.
- Share a new
sessionInfo()
and more output, e.g.print(grl[[1]])
. - Do also last comment in https://github.com/Bioconductor/BiocParallel/issues/111#issuecomment-595673143
I'll mention that valgrind
reports an illegal read (the code takes strlen(x)
where x is not null terminated) but it seems unlikely that that would cause a segfault, and it's not actually at the place where the traceback suggests things are going south (in R code signaling a warning...)
Having very similar issues with rtracklayer::import
on remote bigwig files when using which=
.
parallel::mclapply(..., mc.core=1)
works fine, but parallel::mclapply(..., mc.core=2)
does not, nor does BioParallel:bplapply
when I register workers >1 with MulticoreParam
@MalteThodberg Base on your first post, I'm assuming your package also relies on rtracklayer
. Trying to figure out the common denominator.
It seems rtracklayer
has been accumulating quite a few issues for several years, so it's possible something is broken that's related to this:
https://github.com/lawremi/rtracklayer/issues
I'm wondering if this has something to do with repeated calls to the underlying C library that are made every time you call import on a bigwig
with rtracklayer
. When multiple calls are made at once (i.e. during parallelization) this might cause issues. @sanchit-saini
I ran into similar issues recently when trying to parallelize VariantAnnotation::readVcf
, with an explanation by @mtmorgan here.
I'm not exactly sure why we went down the rabbit hole above, but I think it's likely (though a bit mysterious) to say that the C code in rtracklayer is not re-entrant, so access via forked processes has unpredictable results.
I would guess that SnowParam (i.e., independent rather than forked processes) is more robust. This does require (a) remembering that these processes are independent, so individual packages need to be loaded explicitly (e.g., via rtracklayer::...()
) on the worker and (b) that data is serialized both to and from the worker. The later means that one wants to construct small objects on the way there (e.g., file names, rather than the parsed content of the file) and back (e.g,. summary statistics rather than a GRanges representation of the input). This also means that there can be relatively considerable costs of 'start-up' (loading packages) that make inexpensive 'worker' operations not worth the effort -- the worker needs to do a lot of work.
I don't think there is an easy solution here -- if the C library is not re-entrant, then very considerable work will be needed to make it so.
Is it possible to detect if the current R process is forked and warn when the non-reentrant method is called from a forked process? That would make it easier for users to discover the underlying issue.
Is it possible to detect if the current R process is forked and warn when the non-reentrant method is called from a forked process? That would make it easier for users to discover the underlying issue.
parallel:::isChild()
See also PR#18230 (WISH: Export parallel:::isChild() to make it possible to protect against forked processing).
As @HenrikBengtsson notes it's possible to find out that one is in a forked process (useful for avoiding nested forked processes and over-subscription to the available CPUs) but unfortunately @DarwinAwardWinner there is no way to know that a particular function call is non-reentrant (n.b., it is speculation that this is the problem in the current thread or for @bschilder 's use; it would be useful, though not definitive, to show that the code works without problem under SnowParam()
).
Despite the appeal of forked (MulticoreParam()
) processing for scripts and 'light-weight' tasks, I have come to believe that developers (rather than users writing scripts) should implement their code with SnowParam()
and independent processes. In addition to the problem in this thread, there is a discussion in the Bioconductor slack about interaction between the garbage collector and memory allocation in the forked process, and the basic observation that approximately 1/2 of Bioconductor package downloads are to Windows machines -- targeting MulticoreParam()
as the 'efficient' solution excludes 1/2 a package author's audience. Using separate processes requires much more discipline / sophistication to manage workers and data transfer. There have been efforts in BiocParallel to make this more transparent, especially in the last release and thanks to @Jiefei-Wang.