plyranges
plyranges copied to clipboard
[BUG] stretch() not stretching $blocks in GRanges
Hello Stuart.
I want to use the stretch() function to extend the 5' UTR and 3'UTR ends of a GRanges object as loaded by using the plyranges::read_bed() function, but stretch() is not extending the $blocks IRanges contained in the GRanges.
Following is a reproducible example using the in-package data.
stretch() not stretching gr$blocks
Setup
library(plyranges)
library(GenomicRanges)
library(magrittr)
require(rtracklayer)
Reproducible example
Loading a bed file using read_bed()
test_path <- system.file("tests", package = "rtracklayer")
bed_file <- file.path(test_path, "test.bed")
gr <- read_bed(bed_file)
Extending 3 prime end works fine for GRanges main start and end coordinates.
newGR <- plyranges::stretch(plyranges::anchor_5p(gr), 10)
Total width is extended
width(gr)
## [1] 1167 1167 1167 1167 1167
width(newGR)
## [1] 1177 1177 1177 1177 1177
But blocks size remain the same
gr$blocks %>% lapply(width) %>% lapply(sum) %>% unlist
## [1] 600 750 1167 1167 1167
newGR$blocks %>% lapply(width) %>% lapply(sum) %>% unlist
## [1] 600 750 1167 1167 1167
Session info
options(width = 120)
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.0 (2020-04-24)
## os CentOS Linux 7 (Core)
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Asia/Jerusalem
## date 2020-09-29
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.0)
## Biobase 2.48.0 2020-04-27 [2] Bioconductor
## BiocGenerics * 0.34.0 2020-04-27 [2] Bioconductor
## BiocParallel 1.22.0 2020-04-27 [2] Bioconductor
## Biostrings 2.56.0 2020-04-27 [2] Bioconductor
## bitops 1.0-6 2013-08-17 [2] CRAN (R 4.0.0)
## cli 2.0.2 2020-02-28 [2] CRAN (R 4.0.0)
## crayon 1.3.4 2017-09-16 [2] CRAN (R 4.0.0)
## DelayedArray 0.14.1 2020-07-14 [2] Bioconductor
## digest 0.6.25 2020-02-23 [2] CRAN (R 4.0.0)
## dplyr 1.0.2 2020-08-18 [2] CRAN (R 4.0.0)
## ellipsis 0.3.1 2020-05-15 [2] CRAN (R 4.0.0)
## evaluate 0.14 2019-05-28 [2] CRAN (R 4.0.0)
## fansi 0.4.1 2020-01-08 [2] CRAN (R 4.0.0)
## generics 0.0.2 2018-11-29 [2] CRAN (R 4.0.0)
## GenomeInfoDb * 1.24.2 2020-06-15 [2] Bioconductor
## GenomeInfoDbData 1.2.3 2020-05-03 [2] Bioconductor
## GenomicAlignments 1.24.0 2020-04-27 [2] Bioconductor
## GenomicRanges * 1.40.0 2020-04-27 [2] Bioconductor
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [2] CRAN (R 4.0.0)
## IRanges * 2.22.2 2020-05-21 [2] Bioconductor
## knitr 1.29 2020-06-23 [2] CRAN (R 4.0.0)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.0)
## lifecycle 0.2.0 2020-03-06 [2] CRAN (R 4.0.0)
## magrittr * 1.5 2014-11-22 [2] CRAN (R 4.0.0)
## Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.0)
## matrixStats 0.57.0 2020-09-25 [2] CRAN (R 4.0.0)
## pillar 1.4.6 2020-07-10 [2] CRAN (R 4.0.0)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.0)
## plyranges * 1.8.0 2020-04-27 [1] Bioconductor
## purrr 0.3.4 2020-04-17 [2] CRAN (R 4.0.0)
## R6 2.4.1 2019-11-12 [2] CRAN (R 4.0.0)
## RCurl 1.98-1.2 2020-04-18 [2] CRAN (R 4.0.0)
## rlang 0.4.7 2020-07-09 [2] CRAN (R 4.0.0)
## rmarkdown 2.3 2020-06-18 [2] CRAN (R 4.0.0)
## Rsamtools 2.4.0 2020-04-27 [2] Bioconductor
## rtracklayer * 1.48.0 2020-04-27 [2] Bioconductor
## S4Vectors * 0.26.1 2020-05-16 [2] Bioconductor
## sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.0)
## stringr 1.4.0 2019-02-10 [2] CRAN (R 4.0.0)
## SummarizedExperiment 1.18.2 2020-07-09 [2] Bioconductor
## tibble 3.0.3 2020-07-10 [2] CRAN (R 4.0.0)
## tidyselect 1.1.0 2020-05-11 [2] CRAN (R 4.0.0)
## vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.0)
## withr 2.3.0 2020-09-22 [2] CRAN (R 4.0.0)
## xfun 0.17 2020-09-09 [2] CRAN (R 4.0.0)
## XML 3.99-0.5 2020-07-23 [2] CRAN (R 4.0.0)
## XVector 0.28.0 2020-04-27 [2] Bioconductor
## yaml 2.2.1 2020-02-01 [2] CRAN (R 4.0.0)
## zlibbioc 1.34.0 2020-04-27 [2] Bioconductor
##
--
Is there a function I'm missing that does what I am expecting (blocks size update)? Thanks for the help Best Miguel
Hi Miguel,
So stretch()
has no knowledge that there's an additional column called blocks
in your GRanges
.
How are you expecting blocks to be updated? Are you expecting the total width of the block to be extended by 10?
Here's a few ways you could do this:
suppressPackageStartupMessages(library(plyranges))
test_path <- system.file("tests", package = "rtracklayer")
bed_file <- file.path(test_path, "test.bed")
gr <- read_bed(bed_file)
newGR <- stretch(anchor_5p(gr), 10)
# add stretch amount to each element by the number of blocks,
# naive if not completely divisible (off by one)
ans <- newGR %>%
mutate(blocks = resize(blocks, width(blocks) + 10 / lengths(blocks)))
ans
#> GRanges object with 5 ranges and 5 metadata columns:
#> seqnames ranges strand | name score itemRgb
#> <Rle> <IRanges> <Rle> | <character> <numeric> <character>
#> [1] chr7 127471197-127472373 + | Pos1 0 #FF0000
#> [2] chr7 127472364-127473540 + | Pos2 2 #FF0000
#> [3] chr7 127473521-127474697 - | Neg1 0 #FF0000
#> [4] chr9 127474698-127475874 + | Pos3 5 #FF0000
#> [5] chr9 127475855-127477031 - | Neg2 5 #0000FF
#> thick blocks
#> <IRanges> <IRangesList>
#> [1] 127471197-127472363 1-303,501-703,1068-1170
#> [2] 127472364-127473530 1-255,668-1172
#> [3] 127473531-127474697 1-1177
#> [4] 127474698-127475864 1-1177
#> [5] 127475865-127477031 1-1177
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
width(ans)
#> [1] 1177 1177 1177 1177 1177
sum(width(ans$blocks))
#> [1] 609 760 1177 1177 1177
even_block_strech <- function(blocks, extend) {
part <- PartitioningByEnd(blocks)
lens <- width(part)
div <- extend %/% lens
rem <- extend %% lens
extends <- rep(div, times = lens)
extends[end(part)] <- extends[end(part)] + rem
# unlist and resize
collapsed <- unlist(blocks)
collapsed <- resize(collapsed, width(collapsed) + extends)
relist(collapsed, part)
}
ans <- newGR %>%
mutate(blocks = even_block_strech(blocks, 10))
ans
#> GRanges object with 5 ranges and 5 metadata columns:
#> seqnames ranges strand | name score itemRgb
#> <Rle> <IRanges> <Rle> | <character> <numeric> <character>
#> [1] chr7 127471197-127472373 + | Pos1 0 #FF0000
#> [2] chr7 127472364-127473540 + | Pos2 2 #FF0000
#> [3] chr7 127473521-127474697 - | Neg1 0 #FF0000
#> [4] chr9 127474698-127475874 + | Pos3 5 #FF0000
#> [5] chr9 127475855-127477031 - | Neg2 5 #0000FF
#> thick blocks
#> <IRanges> <IRangesList>
#> [1] 127471197-127472363 1-303,501-703,1068-1171
#> [2] 127472364-127473530 1-255,668-1172
#> [3] 127473531-127474697 1-1177
#> [4] 127474698-127475864 1-1177
#> [5] 127475865-127477031 1-1177
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
width(ans)
#> [1] 1177 1177 1177 1177 1177
sum(width(ans$blocks))
#> [1] 610 760 1177 1177 1177
last_block_stretch <- function(blocks, extend) {
part <- PartitioningByEnd(blocks)
zeros <- rep_len(0L, sum(width(part)))
zeros[end(part)] <- extend
# unlist and resize
collapsed <- unlist(blocks)
collapsed <- resize(collapsed, width(collapsed) + zeros)
relist(collapsed, part)
}
ans <- newGR %>%
mutate(blocks = last_block_strech(blocks, 10))
#> Error in last_block_strech(blocks, 10): could not find function "last_block_strech"
ans
#> GRanges object with 5 ranges and 5 metadata columns:
#> seqnames ranges strand | name score itemRgb
#> <Rle> <IRanges> <Rle> | <character> <numeric> <character>
#> [1] chr7 127471197-127472373 + | Pos1 0 #FF0000
#> [2] chr7 127472364-127473540 + | Pos2 2 #FF0000
#> [3] chr7 127473521-127474697 - | Neg1 0 #FF0000
#> [4] chr9 127474698-127475874 + | Pos3 5 #FF0000
#> [5] chr9 127475855-127477031 - | Neg2 5 #0000FF
#> thick blocks
#> <IRanges> <IRangesList>
#> [1] 127471197-127472363 1-303,501-703,1068-1171
#> [2] 127472364-127473530 1-255,668-1172
#> [3] 127473531-127474697 1-1177
#> [4] 127474698-127475864 1-1177
#> [5] 127475865-127477031 1-1177
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
width(ans)
#> [1] 1177 1177 1177 1177 1177
sum(width(ans$blocks))
#> [1] 610 760 1177 1177 1177
Created on 2020-09-30 by the reprex package (v0.3.0)
Let me know if that's what you had in mind, and I'll close the issue.
Sorry I didn't explain the expected outcome.
I'm not expecting an even stretch, nor the last block without considering the GRange strand. I was expecting only the blocks containing the 3' UTR to be extended. For the case of positive-strand GRanges that would be the last block, while for negative-stranded it will be the first. The complement function would be to extend the 5' UTR, in which the opposite ending blocks are extended along with the GRanges.
If $blocks awareness is not expected to be added to stretch() in future plyranges versions, I think your answer will already help me to do what I want. It's ok if you close the issue 👍
Thanks for your speedy follow-up.
Ok that makes way more sense! So I think if this is urgent, you could modify the last_block_stretch
function I made above. Thinking about an interface to this, the easiest way would probably be to modify stretch to take a new arg called block which will be a bare variable referencing the list colum. This way we could still use the anchoring concept:
# extend ranges in gr and the block column by 5' UTR
gr %>%
anchor_5p() %>%
stretch(block = block, extend = 10)
# extend ranges in gr and the block column by 3' UTR
gr %>%
anchor_3p() %>%
stretch(block = block, extend = 10)
# only stretch the ranges in gr, default will be block = NULL
gr %>%
stretch(extend = 10)
What do you think?
Sounds good, although not sure if it will be general enough, especially without considering the use of the anchor_3p() or anchor_5p() functions. If anchoring is not considered, even stretching of blocks will most probably give unexpected outcomes. Perhaps a specialized function will work better, in this case.