IRanges icon indicating copy to clipboard operation
IRanges copied to clipboard

On usings Views object effectivly

Open Roleren opened this issue 3 months ago • 0 comments

Given a set of transcripts and a RleList of coverages per chromosome, I currently see no easy way to actually get that coverage, code below is the best I could come up with.

So question is: If this is indeed the fastest way currently, then IRanges need to implement a more efficient internal getter instead of my lapply, if there is already a faster way I would like to know.

From my benchmark on large dataset getting the view 'views' takes 0.3 seconds, to get the final ranges 'r' object, takes 3 seconds, which is where optimization can be done. For example of my implementation that works on spliced transcripts for both strands with unique exon optimizer see: ORFik coverage

It is also interesting why getting coverage from Views is so much faster than direct RleList subsetting by GRanges (i.e. cov[gr]) for big data, I set my cutoff in ORFik to > 5000 transcripts to use View instead of RleList subsetting, any idea why ?

library(GenomicRanges)

cov <- RleList(chr1 = Rle(c(rep(0, 24), 5)), chr2 = Rle(c(rep(0, 24), rep(5, 24))))
# For simplicity do single exon transcripts and ignore strand seperation
tx <- GRangesList(tx1 = "chr1:15-25:+", tx2 = "chr2:15-40:+", tx3 = "chr1:10-25:+")

# Now get coverage:
uex_cvg <- RleList(rep(IntegerList(1), length(tx)))
old_tx_names <- names(tx)
names(tx) <- seq_along(tx) # Group index to reorder later
views <- Views(cov, unlist(tx))
names(views) <- NULL # Remove to avoid double names on unlist
names(views[[1]]) # Here you find index mapping # Tx 1 and 3


r <- do.call(c, lapply(views[lengths(views) > 0], RleList))
uex_cvg[as.integer(names(r))] <- r
names(uex_cvg) <- old_tx_names
uex_cvg # Final correct coverage per transcript

Roleren avatar Sep 29 '25 11:09 Roleren