plyranges icon indicating copy to clipboard operation
plyranges copied to clipboard

[BUG] stretch() not stretching $blocks in GRanges

Open AngelCampos opened this issue 4 years ago • 6 comments

Hello Stuart.

I want to use the stretch() function to extend the 5' UTR and 3'UTR ends of a GRanges object as loaded by using the plyranges::read_bed() function, but stretch() is not extending the $blocks IRanges contained in the GRanges.

Following is a reproducible example using the in-package data.

stretch() not stretching gr$blocks

Setup

library(plyranges)
library(GenomicRanges)
library(magrittr)
require(rtracklayer)

Reproducible example

Loading a bed file using read_bed()

test_path <- system.file("tests", package = "rtracklayer")
bed_file <- file.path(test_path, "test.bed")
gr <- read_bed(bed_file)

Extending 3 prime end works fine for GRanges main start and end coordinates.

newGR <- plyranges::stretch(plyranges::anchor_5p(gr), 10)

Total width is extended

width(gr)
## [1] 1167 1167 1167 1167 1167
width(newGR)
## [1] 1177 1177 1177 1177 1177

But blocks size remain the same

gr$blocks %>% lapply(width) %>% lapply(sum) %>% unlist
## [1]  600  750 1167 1167 1167
newGR$blocks %>% lapply(width) %>% lapply(sum) %>% unlist
## [1]  600  750 1167 1167 1167

Session info

options(width = 120)
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.0 (2020-04-24)
##  os       CentOS Linux 7 (Core)       
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Asia/Jerusalem              
##  date     2020-09-29                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package              * version  date       lib source        
##  assertthat             0.2.1    2019-03-21 [2] CRAN (R 4.0.0)
##  Biobase                2.48.0   2020-04-27 [2] Bioconductor  
##  BiocGenerics         * 0.34.0   2020-04-27 [2] Bioconductor  
##  BiocParallel           1.22.0   2020-04-27 [2] Bioconductor  
##  Biostrings             2.56.0   2020-04-27 [2] Bioconductor  
##  bitops                 1.0-6    2013-08-17 [2] CRAN (R 4.0.0)
##  cli                    2.0.2    2020-02-28 [2] CRAN (R 4.0.0)
##  crayon                 1.3.4    2017-09-16 [2] CRAN (R 4.0.0)
##  DelayedArray           0.14.1   2020-07-14 [2] Bioconductor  
##  digest                 0.6.25   2020-02-23 [2] CRAN (R 4.0.0)
##  dplyr                  1.0.2    2020-08-18 [2] CRAN (R 4.0.0)
##  ellipsis               0.3.1    2020-05-15 [2] CRAN (R 4.0.0)
##  evaluate               0.14     2019-05-28 [2] CRAN (R 4.0.0)
##  fansi                  0.4.1    2020-01-08 [2] CRAN (R 4.0.0)
##  generics               0.0.2    2018-11-29 [2] CRAN (R 4.0.0)
##  GenomeInfoDb         * 1.24.2   2020-06-15 [2] Bioconductor  
##  GenomeInfoDbData       1.2.3    2020-05-03 [2] Bioconductor  
##  GenomicAlignments      1.24.0   2020-04-27 [2] Bioconductor  
##  GenomicRanges        * 1.40.0   2020-04-27 [2] Bioconductor  
##  glue                   1.4.2    2020-08-27 [1] CRAN (R 4.0.0)
##  htmltools              0.5.0    2020-06-16 [2] CRAN (R 4.0.0)
##  IRanges              * 2.22.2   2020-05-21 [2] Bioconductor  
##  knitr                  1.29     2020-06-23 [2] CRAN (R 4.0.0)
##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.0)
##  lifecycle              0.2.0    2020-03-06 [2] CRAN (R 4.0.0)
##  magrittr             * 1.5      2014-11-22 [2] CRAN (R 4.0.0)
##  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.0)
##  matrixStats            0.57.0   2020-09-25 [2] CRAN (R 4.0.0)
##  pillar                 1.4.6    2020-07-10 [2] CRAN (R 4.0.0)
##  pkgconfig              2.0.3    2019-09-22 [2] CRAN (R 4.0.0)
##  plyranges            * 1.8.0    2020-04-27 [1] Bioconductor  
##  purrr                  0.3.4    2020-04-17 [2] CRAN (R 4.0.0)
##  R6                     2.4.1    2019-11-12 [2] CRAN (R 4.0.0)
##  RCurl                  1.98-1.2 2020-04-18 [2] CRAN (R 4.0.0)
##  rlang                  0.4.7    2020-07-09 [2] CRAN (R 4.0.0)
##  rmarkdown              2.3      2020-06-18 [2] CRAN (R 4.0.0)
##  Rsamtools              2.4.0    2020-04-27 [2] Bioconductor  
##  rtracklayer          * 1.48.0   2020-04-27 [2] Bioconductor  
##  S4Vectors            * 0.26.1   2020-05-16 [2] Bioconductor  
##  sessioninfo            1.1.1    2018-11-05 [2] CRAN (R 4.0.0)
##  stringi                1.5.3    2020-09-09 [1] CRAN (R 4.0.0)
##  stringr                1.4.0    2019-02-10 [2] CRAN (R 4.0.0)
##  SummarizedExperiment   1.18.2   2020-07-09 [2] Bioconductor  
##  tibble                 3.0.3    2020-07-10 [2] CRAN (R 4.0.0)
##  tidyselect             1.1.0    2020-05-11 [2] CRAN (R 4.0.0)
##  vctrs                  0.3.4    2020-08-29 [1] CRAN (R 4.0.0)
##  withr                  2.3.0    2020-09-22 [2] CRAN (R 4.0.0)
##  xfun                   0.17     2020-09-09 [2] CRAN (R 4.0.0)
##  XML                    3.99-0.5 2020-07-23 [2] CRAN (R 4.0.0)
##  XVector                0.28.0   2020-04-27 [2] Bioconductor  
##  yaml                   2.2.1    2020-02-01 [2] CRAN (R 4.0.0)
##  zlibbioc               1.34.0   2020-04-27 [2] Bioconductor  
## 

--

Is there a function I'm missing that does what I am expecting (blocks size update)? Thanks for the help Best Miguel

AngelCampos avatar Sep 29 '20 20:09 AngelCampos

Hi Miguel,

So stretch() has no knowledge that there's an additional column called blocks in your GRanges.

How are you expecting blocks to be updated? Are you expecting the total width of the block to be extended by 10?

sa-lee avatar Sep 30 '20 01:09 sa-lee

Here's a few ways you could do this:

suppressPackageStartupMessages(library(plyranges))

test_path <- system.file("tests", package = "rtracklayer")
bed_file <- file.path(test_path, "test.bed")
gr <- read_bed(bed_file)

newGR <- stretch(anchor_5p(gr), 10)

# add stretch amount to each element by the number of blocks, 
# naive if not completely divisible (off by one)
ans <- newGR %>% 
  mutate(blocks = resize(blocks, width(blocks) + 10 / lengths(blocks))) 

ans
#> GRanges object with 5 ranges and 5 metadata columns:
#>       seqnames              ranges strand |        name     score     itemRgb
#>          <Rle>           <IRanges>  <Rle> | <character> <numeric> <character>
#>   [1]     chr7 127471197-127472373      + |        Pos1         0     #FF0000
#>   [2]     chr7 127472364-127473540      + |        Pos2         2     #FF0000
#>   [3]     chr7 127473521-127474697      - |        Neg1         0     #FF0000
#>   [4]     chr9 127474698-127475874      + |        Pos3         5     #FF0000
#>   [5]     chr9 127475855-127477031      - |        Neg2         5     #0000FF
#>                     thick                  blocks
#>                 <IRanges>           <IRangesList>
#>   [1] 127471197-127472363 1-303,501-703,1068-1170
#>   [2] 127472364-127473530          1-255,668-1172
#>   [3] 127473531-127474697                  1-1177
#>   [4] 127474698-127475864                  1-1177
#>   [5] 127475865-127477031                  1-1177
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
width(ans)
#> [1] 1177 1177 1177 1177 1177
sum(width(ans$blocks))
#> [1]  609  760 1177 1177 1177

even_block_strech <- function(blocks, extend) {
    part <- PartitioningByEnd(blocks)
    lens <- width(part)
    div <- extend %/% lens
    rem <- extend %% lens
    extends <- rep(div, times = lens)
    extends[end(part)] <- extends[end(part)] + rem
    # unlist and resize
    collapsed <- unlist(blocks)
    collapsed <- resize(collapsed,  width(collapsed) + extends)
    relist(collapsed, part)
}

ans <- newGR %>% 
  mutate(blocks = even_block_strech(blocks, 10))

ans
#> GRanges object with 5 ranges and 5 metadata columns:
#>       seqnames              ranges strand |        name     score     itemRgb
#>          <Rle>           <IRanges>  <Rle> | <character> <numeric> <character>
#>   [1]     chr7 127471197-127472373      + |        Pos1         0     #FF0000
#>   [2]     chr7 127472364-127473540      + |        Pos2         2     #FF0000
#>   [3]     chr7 127473521-127474697      - |        Neg1         0     #FF0000
#>   [4]     chr9 127474698-127475874      + |        Pos3         5     #FF0000
#>   [5]     chr9 127475855-127477031      - |        Neg2         5     #0000FF
#>                     thick                  blocks
#>                 <IRanges>           <IRangesList>
#>   [1] 127471197-127472363 1-303,501-703,1068-1171
#>   [2] 127472364-127473530          1-255,668-1172
#>   [3] 127473531-127474697                  1-1177
#>   [4] 127474698-127475864                  1-1177
#>   [5] 127475865-127477031                  1-1177
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
width(ans)
#> [1] 1177 1177 1177 1177 1177
sum(width(ans$blocks))
#> [1]  610  760 1177 1177 1177

last_block_stretch <- function(blocks, extend) {
  part <- PartitioningByEnd(blocks)
  zeros <- rep_len(0L, sum(width(part)))
  zeros[end(part)] <- extend
  # unlist and resize
  collapsed <- unlist(blocks)
  collapsed <- resize(collapsed,  width(collapsed) + zeros)
  relist(collapsed, part)
}

ans <- newGR %>% 
  mutate(blocks = last_block_strech(blocks, 10))
#> Error in last_block_strech(blocks, 10): could not find function "last_block_strech"

ans
#> GRanges object with 5 ranges and 5 metadata columns:
#>       seqnames              ranges strand |        name     score     itemRgb
#>          <Rle>           <IRanges>  <Rle> | <character> <numeric> <character>
#>   [1]     chr7 127471197-127472373      + |        Pos1         0     #FF0000
#>   [2]     chr7 127472364-127473540      + |        Pos2         2     #FF0000
#>   [3]     chr7 127473521-127474697      - |        Neg1         0     #FF0000
#>   [4]     chr9 127474698-127475874      + |        Pos3         5     #FF0000
#>   [5]     chr9 127475855-127477031      - |        Neg2         5     #0000FF
#>                     thick                  blocks
#>                 <IRanges>           <IRangesList>
#>   [1] 127471197-127472363 1-303,501-703,1068-1171
#>   [2] 127472364-127473530          1-255,668-1172
#>   [3] 127473531-127474697                  1-1177
#>   [4] 127474698-127475864                  1-1177
#>   [5] 127475865-127477031                  1-1177
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
width(ans)
#> [1] 1177 1177 1177 1177 1177
sum(width(ans$blocks))
#> [1]  610  760 1177 1177 1177

Created on 2020-09-30 by the reprex package (v0.3.0)

sa-lee avatar Sep 30 '20 02:09 sa-lee

Let me know if that's what you had in mind, and I'll close the issue.

sa-lee avatar Sep 30 '20 02:09 sa-lee

Sorry I didn't explain the expected outcome.

I'm not expecting an even stretch, nor the last block without considering the GRange strand. I was expecting only the blocks containing the 3' UTR to be extended. For the case of positive-strand GRanges that would be the last block, while for negative-stranded it will be the first. The complement function would be to extend the 5' UTR, in which the opposite ending blocks are extended along with the GRanges.

If $blocks awareness is not expected to be added to stretch() in future plyranges versions, I think your answer will already help me to do what I want. It's ok if you close the issue 👍

Thanks for your speedy follow-up.

AngelCampos avatar Sep 30 '20 12:09 AngelCampos

Ok that makes way more sense! So I think if this is urgent, you could modify the last_block_stretch function I made above. Thinking about an interface to this, the easiest way would probably be to modify stretch to take a new arg called block which will be a bare variable referencing the list colum. This way we could still use the anchoring concept:

# extend ranges in gr and the block column by 5' UTR
gr %>%
  anchor_5p() %>%
  stretch(block = block, extend = 10)

# extend ranges in gr and the block column by 3' UTR
gr %>%
  anchor_3p() %>%
  stretch(block = block, extend = 10)

# only stretch the ranges in gr, default will be block = NULL
gr %>%
  stretch(extend = 10)

What do you think?

sa-lee avatar Sep 30 '20 23:09 sa-lee

Sounds good, although not sure if it will be general enough, especially without considering the use of the anchor_3p() or anchor_5p() functions. If anchoring is not considered, even stretching of blocks will most probably give unexpected outcomes. Perhaps a specialized function will work better, in this case.

AngelCampos avatar Oct 09 '20 18:10 AngelCampos