rtracklayer
rtracklayer copied to clipboard
Unable to read remote bed.bgz
Hello,
I'm trying to query subsets of Roadmap bed.gz files. However there seems to be some issues here:
Reprex
Code
URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"
res <- rtracklayer::import(URL)
## same error when using `which`
gr <- GenomicRanges::GRanges("chr4:14737349-16737284")
res <- rtracklayer::import(URL, which=gr)
Console output
Error in .new_IRanges_from_start_end(start, end) :
'start' or 'end' cannot contain NAs
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
Traceback
Error in .new_IRanges_from_start_end(start, end) :
'start' or 'end' cannot contain NAs
11.
stop(wmsg("'start' or 'end' cannot contain NAs"))
10.
.new_IRanges_from_start_end(start, end)
9.
.new_IRanges(start = start, end = end, width = width)
8.
IRanges(bed$thickStart + 1L, thickEnd)
7.
.local(con, format, text, ...)
6.
import(con, ...)
5.
import(con, ...)
4.
import(FileForFormat(con), ...)
3.
import(FileForFormat(con), ...)
2.
rtracklayer::import(URL)
1.
rtracklayer::import(URL)
Session info
Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] utf8_1.2.2 reticulate_1.25 R.utils_2.12.0
[4] tidyselect_1.1.2 RSQLite_2.2.16 AnnotationDbi_1.58.0
[7] htmlwidgets_1.5.4 grid_4.2.1 BiocParallel_1.30.3
[10] XGR_1.1.8 munsell_0.5.0 codetools_0.2-18
[13] interp_1.1-3 DT_0.24 colorspace_2.0-3
[16] OrganismDbi_1.38.1 Biobase_2.56.0 filelock_1.0.2
[19] knitr_1.40 supraHex_1.34.0 rstudioapi_0.14
[22] stats4_4.2.1 DescTools_0.99.45 MatrixGenerics_1.8.1
[25] GenomeInfoDbData_1.2.8 bit64_4.0.5 echoconda_0.99.7
[28] basilisk_1.8.1 vctrs_0.4.1 generics_0.1.3
[31] xfun_0.32 biovizBase_1.44.0 BiocFileCache_2.4.0
[34] R6_2.5.1 GenomeInfoDb_1.32.3 AnnotationFilter_1.20.0
[37] bitops_1.0-7 cachem_1.0.6 reshape_0.8.9
[40] DelayedArray_0.22.0 assertthat_0.2.1 BiocIO_1.6.0
[43] scales_1.2.1 nnet_7.3-17 rootSolve_1.8.2.3
[46] gtable_0.3.0 lmom_2.9 ggbio_1.44.1
[49] ensembldb_2.20.2 rlang_1.0.4 echodata_0.99.12
[52] splines_4.2.1 lazyeval_0.2.2 rtracklayer_1.57.0
[55] dichromat_2.0-0.1 hexbin_1.28.2 checkmate_2.1.0
[58] BiocManager_1.30.18 yaml_2.3.5 reshape2_1.4.4
[61] GenomicFeatures_1.48.3 ggnetwork_0.5.10 backports_1.4.1
[64] Hmisc_4.7-1 RBGL_1.72.0 tools_4.2.1
[67] ggplot2_3.3.6 ellipsis_0.3.2 RColorBrewer_1.1-3
[70] proxy_0.4-27 BiocGenerics_0.42.0 Rcpp_1.0.9
[73] plyr_1.8.7 base64enc_0.1-3 progress_1.2.2
[76] zlibbioc_1.42.0 purrr_0.3.4 RCurl_1.98-1.8
[79] basilisk.utils_1.8.0 prettyunits_1.1.1 rpart_4.1.16
[82] deldir_1.0-6 S4Vectors_0.34.0 SummarizedExperiment_1.26.1
[85] ggrepel_0.9.1 cluster_2.1.4 fs_1.5.2
[88] crul_1.2.0 magrittr_2.0.3 data.table_1.14.2
[91] echotabix_0.99.7 dnet_1.1.7 openxlsx_4.2.5
[94] mvtnorm_1.1-3 ProtGenerics_1.28.0 matrixStats_0.62.0
[97] patchwork_1.1.2 hms_1.1.2 evaluate_0.16
[100] XML_3.99-0.10 jpeg_0.1-9 readxl_1.4.1
[103] IRanges_2.30.1 gridExtra_2.3 compiler_4.2.1
[106] biomaRt_2.52.0 tibble_3.1.8 crayon_1.5.1
[109] R.oo_1.25.0 htmltools_0.5.3 echoannot_0.99.7
[112] tzdb_0.3.0 Formula_1.2-4 tidyr_1.2.0
[115] expm_0.999-6 Exact_3.1 DBI_1.1.3
[118] dbplyr_2.2.1 MASS_7.3-58.1 rappdirs_0.3.3
[121] boot_1.3-28 Matrix_1.4-1 readr_2.1.2
[124] piggyback_0.1.4 cli_3.3.0 R.methodsS3_1.8.2
[127] parallel_4.2.1 igraph_1.3.4 GenomicRanges_1.48.0
[130] pkgconfig_2.0.3 GenomicAlignments_1.32.1 dir.expiry_1.4.0
[133] RCircos_1.2.2 foreign_0.8-82 osfr_0.2.8
[136] xml2_1.3.3 XVector_0.36.0 stringr_1.4.1
[139] VariantAnnotation_1.42.1 digest_0.6.29 graph_1.74.0
[142] httpcode_0.3.0 Biostrings_2.64.1 rmarkdown_2.16
[145] cellranger_1.1.0 htmlTable_2.4.1 gld_2.6.5
[148] restfulr_0.0.15 curl_4.3.2 Rsamtools_2.12.0
[151] rjson_0.2.21 lifecycle_1.0.1 nlme_3.1-159
[154] jsonlite_1.8.0 BSgenome_1.64.0 fansi_1.0.3
[157] pillar_1.8.1 lattice_0.20-45 GGally_2.1.2
[160] KEGGREST_1.36.3 fastmap_1.1.0 httr_1.4.4
[163] survival_3.4-0 glue_1.6.2 zip_2.2.0
[166] png_0.1-7 bit_4.0.4 Rgraphviz_2.40.0
[169] class_7.3-20 stringi_1.7.8 blob_1.2.3
[172] latticeExtra_0.6-30 memoise_2.0.1 dplyr_1.0.9
[175] e1071_1.7-11 ape_5.6-2
</details>
Many thanks in advance,
Brian
It seems that the issue only occurs when the file is remote. Downloading the entire file first does technically circumvent the issue, though this approach isn't ideal:
URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"
tmp <- file.path(tempfile(),basename(URL))
dir.create(dirname(tmp),showWarnings = FALSE, recursive = TRUE)
utils::download.file(url = URL, destfile = tmp)
res <- rtracklayer::import(tmp)
@lawremi @sanchit-saini, this error is still persisting. Do you know what might be happening here?
@hpages I know you're not the primary maintainer for this repo, but I was wondering if you nevertheless have any ideas about this. Thank you!
Hi @bschilder , I didn't investigate much but this is how rtracklayer::import()
reads this remote BED file:
URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"
con1 <- gzcon(url(URL, open="rb"), text=TRUE)
bedNames <- c("chrom", "start", "end", "name", "score", "strand", "thickStart", "thickEnd", "itemRgb")
bedClasses <- c("character", "integer", "integer", "character", "numeric", "character", "integer", "integer", "character")
bed1 <- read.table(con1, colClasses=bedClasses, as.is=TRUE, na.strings=".", comment.char="", sep="\t", quote="")
# Warning message:
# In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
# number of items read is not a multiple of the number of columns
colnames(bed1) <- bedNames
Major problem is that bed1
only has 1127 rows!
dim(bed1)
# [1] 1127 9
Minor problem is that the last row has an unexpected NA
in the thickEnd
column:
tail(bed1)
# chrom start end name score strand thickStart thickEnd
# 1122 chr10 8459800 8462200 5_TxWk 0 <NA> 8459800 8462200
# 1123 chr10 8462200 8462400 7_Enh 0 <NA> 8462200 8462400
# 1124 chr10 8462400 8477200 15_Quies 0 <NA> 8462400 8477200
# 1125 chr10 8477200 8483600 7_Enh 0 <NA> 8477200 8483600
# 1126 chr10 8483600 8491600 5_TxWk 0 <NA> 8483600 8491600
# 1127 chr10 8491600 8492800 7_Enh 0 <NA> 849160 NA
# itemRgb
# 1122 0,100,0
# 1123 255,255,0
# 1124 255,255,255
# 1125 255,255,0
# 1126 0,100,0
# 1127
Compare this to what rtracklayer::import()
does to read the file locally:
con2 <- gzfile("E099_15_coreMarks_dense.bed.bgz", open="r")
bed2 <- read.table(con2, colClasses=bedClasses, as.is=TRUE, na.strings=".", comment.char="", sep="\t", quote="")
colnames(bed2) <- bedNames
dim(bed2)
# [1] 265577 9
anyNA(bed2$thickEnd)
# [1] FALSE
So it looks to me that read.table()
is somehow broken on a connection obtained with gzcon(url(URL, open="rb"), text=TRUE)
.
Hi @hpages thanks for taking a look at this, this definitely gives me some clues.
I actually found the below approach does a decent job. Of course, this is just a temporary solution as it still has to download the whole file first, which is less efficient than making a query. But at least this approach consistently reads in all 265,577 rows , and can handle situations where there's some NAs.
path <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"
tmp <- file.path(tempdir(), basename(path))
## Only download it if you have to
if(!file.exists(tmp)){
utils::download.file(url = path,
destfile = tmp)
}
dat <- data.table::fread(text=readLines(con = tmp))
gr <- GenomicRanges::makeGRangesFromDataFrame(df = dat, seqnames.field = "V1", start.field = "V2", end.field = "V3", keep.extra.columns = TRUE)
Minor problem is that the last row has an unexpected NA in the thickEnd column:
rtracklayer
seems to get hung up on the NA
. This doesn't seem to be an issue with creating the GRanges
object manually (see above example), so i'm wondering if there's a way to have rtracklayer::import
simply throw a warning instead of an error? @sanchit-saini
i'm wondering if there's a way to have rtracklayer::import simply throw a warning instead of an error?
That doesn't solve the real problem which is that the remote file doesn't contain any NA so there shouldn't be one in the data.frame obtained by importing the file.
One workaround is to not try to import the remote file directly, but to download it to a temporary file first, like you did manually, except that this would need to be baked in rtracklayer::import()
.
FWIW I adopted that approach in GenomeInfoDb::getChromInfoFromNCBI()
last year (see https://github.com/Bioconductor/GenomeInfoDb/commit/58c8310ca45791c8ca5217df3598be9f23cb3168). Works a lot better now!
i'm wondering if there's a way to have rtracklayer::import simply throw a warning instead of an error?
That doesn't solve the real problem which is that the remote file doesn't contain any NA so there shouldn't be one in the data.frame obtained by importing the file.
Ah ok, I see what you mean. Very strange that the NA gets introduced somehow.
One workaround is to not try to import the remote file directly, but to download it to a temporary file first, like you did manually, except that this would need to be baked in
rtracklayer::import()
. FWIW I adopted that approach inGenomeInfoDb::getChromInfoFromNCBI()
last year (see Bioconductor/GenomeInfoDb@58c8310). Works a lot better now!
Nice! Thanks for sharing this. @sanchit-saini @lawremi something along these lines would be a great feature forrtracklayer