GenomicFeatures icon indicating copy to clipboard operation
GenomicFeatures copied to clipboard

mapToTranscripts reports incorrect regions when supplied GRangesList containing some empty transcripts

Open kriemo opened this issue 2 years ago • 0 comments

I experienced an error similar to one reported on the support site. Incorrect annotations are returned when calling a VariantAnnotation function (locateVariants() with region = IntronVariants() ) that internally uses mapToTranscripts .

I believe this error is caused by passing a list containing some empty transcripts to the transcripts argument of mapToTranscripts(). This can be fixed by removing the empty transcripts, but perhaps it might be useful to signal an error to the user in this condition or handle this internally? As far as i can tell, the erroneous mapping occurs in the .orderElementsByTranscription function.

Here is a reprex using the example provided on the support site and an example of how .orderElementsByTranscription will reorder/duplicate transcripts in the presence of empty list elements.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(GenomicFeatures)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

# generate a grl with some zero length entries
grl <- intronsByTranscript(txdb)
sum(elementNROWS(grl) == 0)
#> [1] 10556

gr <- GRanges(seqnames = "chr5",
              ranges = IRanges(start = 20298238,
                               end = 20298238))
genome(gr) <- "hg19"

tx <- mapToTranscripts(gr, grl)
tx
#> GRanges object with 1 range and 2 metadata columns:
#>       seqnames    ranges strand |     xHits transcriptsHits
#>          <Rle> <IRanges>  <Rle> | <integer>       <integer>
#>   [1]    19778    277333      - |         1           19778
#>   -------
#>   seqinfo: 82960 sequences from an unspecified genome

## wrong transcript is mapped
grl[[seqnames(tx)]]
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames              ranges strand
#>          <Rle>           <IRanges>  <Rle>
#>   [1]     chr4 110610725-110612005      -
#>   [2]     chr4 110612166-110615680      -
#>   [3]     chr4 110615857-110624511      -
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome

## should map to last entry (14) in this transcript
grl[[subjectHits(findOverlaps(gr, grl))]]
#> GRanges object with 14 ranges and 0 metadata columns:
#>        seqnames            ranges strand
#>           <Rle>         <IRanges>  <Rle>
#>    [1]     chr5 19473826-19483409      -
#>    [2]     chr5 19483662-19503100      -
#>    [3]     chr5 19503219-19520765      -
#>    [4]     chr5 19520888-19543977      -
#>    [5]     chr5 19544115-19571687      -
#>    ...      ...               ...    ...
#>   [10]     chr5 19747346-19838867      -
#>   [11]     chr5 19839352-19981168      -
#>   [12]     chr5 19981288-19991981      -
#>   [13]     chr5 19992124-20255552      -
#>   [14]     chr5 20255615-20575570      -
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome

## removing empty entries returns correct transcript
grl_no_zero  <- grl[elementNROWS(grl) > 0]
tx2 <- mapToTranscripts(gr, grl_no_zero)
grl_no_zero[[seqnames(tx2)]]
#> GRanges object with 14 ranges and 0 metadata columns:
#>        seqnames            ranges strand
#>           <Rle>         <IRanges>  <Rle>
#>    [1]     chr5 19473826-19483409      -
#>    [2]     chr5 19483662-19503100      -
#>    [3]     chr5 19503219-19520765      -
#>    [4]     chr5 19520888-19543977      -
#>    [5]     chr5 19544115-19571687      -
#>    ...      ...               ...    ...
#>   [10]     chr5 19747346-19838867      -
#>   [11]     chr5 19839352-19981168      -
#>   [12]     chr5 19981288-19991981      -
#>   [13]     chr5 19992124-20255552      -
#>   [14]     chr5 20255615-20575570      -
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome

## looks like .orderElementsByTranscription changes the order of elements 
## in the `grl` list when zero-length elements are in grl?
## e.g. the transcript in grl[[7]] != the transcript in ord_grl[[7]]
ord_grl <- GenomicFeatures:::.orderElementsByTranscription(grl)

# list with 1 empty entry 
grl[c(3,4,7)]
#> GRangesList object of length 3:
#> $`3`
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames      ranges strand
#>          <Rle>   <IRanges>  <Rle>
#>   [1]     chr1 12228-12645      +
#>   [2]     chr1 12698-13220      +
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome
#> 
#> $`4`
#> GRanges object with 0 ranges and 0 metadata columns:
#>    seqnames    ranges strand
#>       <Rle> <IRanges>  <Rle>
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome
#> 
#> $`7`
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames        ranges strand
#>          <Rle>     <IRanges>  <Rle>
#>   [1]     chr1 322229-324287      +
#>   [2]     chr1 324346-324438      +
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome

# first entry is duplicated at position 3
GenomicFeatures:::.orderElementsByTranscription(grl[c(3,4,7)])
#> GRangesList object of length 3:
#> $`3`
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames      ranges strand | unordered
#>          <Rle>   <IRanges>  <Rle> | <integer>
#>   [1]     chr1 12228-12645      + |         1
#>   [2]     chr1 12698-13220      + |         2
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome
#> 
#> $`4`
#> GRanges object with 0 ranges and 1 metadata column:
#>    seqnames    ranges strand | unordered
#>       <Rle> <IRanges>  <Rle> | <integer>
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome
#> 
#> $`7`
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames      ranges strand | unordered
#>          <Rle>   <IRanges>  <Rle> | <integer>
#>   [1]     chr1 12228-12645      + |         1
#>   [2]     chr1 12698-13220      + |         2
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome

Created on 2023-01-12 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R Under development (unstable) (2022-10-30 r83209)
#>  os       macOS Monterey 12.2.1
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Denver
#>  date     2023-01-12
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package                           * version   date (UTC) lib source
#>  AnnotationDbi                     * 1.61.0    2022-11-01 [1] Bioconductor
#>  assertthat                          0.2.1     2019-03-21 [1] CRAN (R 4.3.0)
#>  Biobase                           * 2.59.0    2022-11-01 [1] Bioconductor
#>  BiocFileCache                       2.7.1     2022-12-09 [1] Bioconductor
#>  BiocGenerics                      * 0.45.0    2022-11-01 [1] Bioconductor
#>  BiocIO                              1.9.1     2022-11-24 [1] Bioconductor
#>  BiocParallel                        1.33.9    2022-12-23 [1] Bioconductor
#>  biomaRt                             2.55.0    2022-11-01 [1] Bioconductor
#>  Biostrings                          2.67.0    2022-11-01 [1] Bioconductor
#>  bit                                 4.0.5     2022-11-15 [1] CRAN (R 4.3.0)
#>  bit64                               4.0.5     2020-08-30 [1] CRAN (R 4.3.0)
#>  bitops                              1.0-7     2021-04-24 [1] CRAN (R 4.3.0)
#>  blob                                1.2.3     2022-04-10 [1] CRAN (R 4.3.0)
#>  cachem                              1.0.6     2021-08-19 [1] CRAN (R 4.3.0)
#>  cli                                 3.6.0     2023-01-09 [1] CRAN (R 4.3.0)
#>  codetools                           0.2-18    2020-11-04 [1] CRAN (R 4.3.0)
#>  crayon                              1.5.2     2022-09-29 [1] CRAN (R 4.3.0)
#>  curl                                5.0.0     2023-01-12 [1] CRAN (R 4.3.0)
#>  DBI                                 1.1.3     2022-06-18 [1] CRAN (R 4.3.0)
#>  dbplyr                              2.2.1     2022-06-27 [1] CRAN (R 4.3.0)
#>  DelayedArray                        0.25.0    2022-11-01 [1] Bioconductor
#>  digest                              0.6.31    2022-12-11 [1] CRAN (R 4.3.0)
#>  dplyr                               1.0.10    2022-09-01 [1] CRAN (R 4.3.0)
#>  ellipsis                            0.3.2     2021-04-29 [1] CRAN (R 4.3.0)
#>  evaluate                            0.19      2022-12-13 [1] CRAN (R 4.3.0)
#>  fansi                               1.0.3     2022-03-24 [1] CRAN (R 4.3.0)
#>  fastmap                             1.1.0     2021-01-25 [1] CRAN (R 4.3.0)
#>  filelock                            1.0.2     2018-10-05 [1] CRAN (R 4.3.0)
#>  fs                                  1.5.2     2021-12-08 [1] CRAN (R 4.3.0)
#>  generics                            0.1.3     2022-07-05 [1] CRAN (R 4.3.0)
#>  GenomeInfoDb                      * 1.35.10   2023-01-03 [1] Bioconductor
#>  GenomeInfoDbData                    1.2.9     2022-11-08 [1] Bioconductor
#>  GenomicAlignments                   1.35.0    2022-11-01 [1] Bioconductor
#>  GenomicFeatures                   * 1.51.4    2022-12-23 [1] Bioconductor
#>  GenomicRanges                     * 1.51.4    2022-12-15 [1] Bioconductor
#>  glue                                1.6.2     2022-02-24 [1] CRAN (R 4.3.0)
#>  highr                               0.10      2022-12-22 [1] CRAN (R 4.3.0)
#>  hms                                 1.1.2     2022-08-19 [1] CRAN (R 4.3.0)
#>  htmltools                           0.5.4     2022-12-07 [1] CRAN (R 4.3.0)
#>  httr                                1.4.4     2022-08-17 [1] CRAN (R 4.3.0)
#>  IRanges                           * 2.33.0    2022-11-01 [1] Bioconductor
#>  KEGGREST                            1.39.0    2022-11-01 [1] Bioconductor
#>  knitr                               1.41      2022-11-18 [1] CRAN (R 4.3.0)
#>  lattice                             0.20-45   2021-09-22 [1] CRAN (R 4.3.0)
#>  lifecycle                           1.0.3     2022-10-07 [1] CRAN (R 4.3.0)
#>  magrittr                            2.0.3     2022-03-30 [1] CRAN (R 4.3.0)
#>  Matrix                              1.5-3     2022-11-11 [1] CRAN (R 4.3.0)
#>  MatrixGenerics                      1.11.0    2022-11-01 [1] Bioconductor
#>  matrixStats                         0.63.0    2022-11-18 [1] CRAN (R 4.3.0)
#>  memoise                             2.0.1     2021-11-26 [1] CRAN (R 4.3.0)
#>  pillar                              1.8.1     2022-08-19 [1] CRAN (R 4.3.0)
#>  pkgconfig                           2.0.3     2019-09-22 [1] CRAN (R 4.3.0)
#>  png                                 0.1-8     2022-11-29 [1] CRAN (R 4.3.0)
#>  prettyunits                         1.1.1     2020-01-24 [1] CRAN (R 4.3.0)
#>  progress                            1.2.2     2019-05-16 [1] CRAN (R 4.3.0)
#>  purrr                               1.0.1     2023-01-10 [1] CRAN (R 4.3.0)
#>  R.cache                             0.16.0    2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3                         1.8.2     2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo                                1.25.0    2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils                             2.12.2    2022-11-11 [1] CRAN (R 4.3.0)
#>  R6                                  2.5.1     2021-08-19 [1] CRAN (R 4.3.0)
#>  rappdirs                            0.3.3     2021-01-31 [1] CRAN (R 4.3.0)
#>  Rcpp                                1.0.9     2022-07-08 [1] CRAN (R 4.3.0)
#>  RCurl                               1.98-1.9  2022-10-03 [1] CRAN (R 4.3.0)
#>  reprex                              2.0.2     2022-08-17 [1] CRAN (R 4.3.0)
#>  restfulr                            0.0.15    2022-06-16 [1] CRAN (R 4.3.0)
#>  rjson                               0.2.21    2022-01-09 [1] CRAN (R 4.3.0)
#>  rlang                               1.0.6     2022-09-24 [1] CRAN (R 4.3.0)
#>  rmarkdown                           2.19      2022-12-15 [1] CRAN (R 4.3.0)
#>  Rsamtools                           2.15.1    2022-12-30 [1] Bioconductor
#>  RSQLite                             2.2.20    2022-12-22 [1] CRAN (R 4.3.0)
#>  rstudioapi                          0.14      2022-08-22 [1] CRAN (R 4.3.0)
#>  rtracklayer                         1.59.1    2022-12-27 [1] Bioconductor
#>  S4Vectors                         * 0.37.3    2022-12-07 [1] Bioconductor
#>  sessioninfo                         1.2.2     2021-12-06 [1] CRAN (R 4.3.0)
#>  stringi                             1.7.12    2023-01-11 [1] CRAN (R 4.3.0)
#>  stringr                             1.5.0     2022-12-02 [1] CRAN (R 4.3.0)
#>  styler                              1.8.1     2022-11-07 [1] CRAN (R 4.3.0)
#>  SummarizedExperiment                1.29.1    2022-11-04 [1] Bioconductor
#>  tibble                              3.1.8     2022-07-22 [1] CRAN (R 4.3.0)
#>  tidyselect                          1.2.0     2022-10-10 [1] CRAN (R 4.3.0)
#>  TxDb.Hsapiens.UCSC.hg19.knownGene * 3.2.2     2023-01-12 [1] Bioconductor
#>  utf8                                1.2.2     2021-07-24 [1] CRAN (R 4.3.0)
#>  vctrs                               0.5.1     2022-11-16 [1] CRAN (R 4.3.0)
#>  withr                               2.5.0     2022-03-03 [1] CRAN (R 4.3.0)
#>  xfun                                0.36      2022-12-21 [1] CRAN (R 4.3.0)
#>  XML                                 3.99-0.13 2022-12-04 [1] CRAN (R 4.3.0)
#>  xml2                                1.3.3     2021-11-30 [1] CRAN (R 4.3.0)
#>  XVector                             0.39.0    2022-11-01 [1] Bioconductor
#>  yaml                                2.3.6     2022-10-18 [1] CRAN (R 4.3.0)
#>  zlibbioc                            1.45.0    2022-11-01 [1] Bioconductor
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

kriemo avatar Jan 12 '23 23:01 kriemo