plyranges
plyranges copied to clipboard
Deferred I/O
To summarise our discussion earlier here's what we're going to implement with regards to I/O (see also #15):
- [x]
read_bam
withpaired = TRUE
returns a grouped granges grouped by pair. - [x] implement
DeferredGRanges
class with a cache slot and an ops to allow for deferred reading of data inputs. - [x] implement specs for filtering and selection on the
DeferredRanges
according to @lawremi suggestion on Oct 17 for issue #1 - [x]
chop
functions - [ ] implement method for putting GRanges from wide to long form.
I've had a go of this now - see commit https://github.com/sa-lee/plyranges/commit/0fadb502928aa045fc24dbc5f4bb27d32cb804bd. I'm not sure if this is the kinda approach you were thinking, let me know.
Here's an example of reading a bam:
library(plyranges)
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
bam <- bam_file(fl) %>%
select_fields(qual, flag) %>%
filter_by_overlaps(GRanges("seq1", IRanges(1000, 1100)) %>%
read_bam()
bam
GRanges object with 161 ranges and 6 metadata columns:
seqnames ranges strand | qual flag which_label
<Rle> <IRanges> <Rle> | <PhredQuality> <integer> <Rle>
[1] seq1 [970, 1004] + | =8=====;=8...5==;=;.;25 99 seq1:1000-1100
[2] seq1 [971, 1005] + | <<<<<<<<<;...<<<<<-;<4; 163 seq1:1000-1100
[3] seq1 [972, 1006] + | <<<<<<<<<<...<<<.<<.<,< 99 seq1:1000-1100
[4] seq1 [973, 1007] + | <<<<<<7<<<...&<58<<6::: 163 seq1:1000-1100
[5] seq1 [974, 1008] + | <<<<<<<<<<...<<<<:<::8< 99 seq1:1000-1100
... ... ... ... . ... ... ...
[157] seq1 [1098, 1132] - | ;6<02;<<<<...<<<<<<<<;< 83 seq1:1000-1100
[158] seq1 [1099, 1133] - | &;972<;&<9...;<;;3<<3<< 83 seq1:1000-1100
[159] seq1 [1099, 1133] + | ;.;;;;;;;;...;73;;77777 163 seq1:1000-1100
[160] seq1 [1100, 1134] - | .5:,666<)<...<<<5<<7<<< 147 seq1:1000-1100
[161] seq1 [1100, 1134] + | <<<<<<<<<<...;<<*<<<<9< 99 seq1:1000-1100
cigar qwidth njunc
<character> <integer> <integer>
[1] 35M 35 0
[2] 35M 35 0
[3] 35M 35 0
[4] 35M 35 0
[5] 35M 35 0
... ... ... ...
[157] 35M 35 0
[158] 35M 35 0
[159] 35M 35 0
[160] 35M 35 0
[161] 35M 35 0
-------
seqinfo: 2 sequences from an unspecified genome
In the above example, the select and filter tags are modifying an environment that contains a BamFile
and scanBamParam
object, once read_bam is called the BAM file is actually read into memory.
I was imagining that the pipeline would begin with read_bam()
(a deferred read) and perform the read as soon as a getter is called (probably in the implementation of one our wrappers). The idea is that each getter will:
- Check whether cached data is available.
- If it is, return that.
- If it isn't, load the data, cache it and return it.
Setters can just construct and return a new, materialized GRanges, with the newly set value.
As an aside, I think we want to deemphasize stuff like bam_file()
because it's a different object type.
@lawremi I'm thinking of refactoring the DeferredGRanges (most likely in the next plyranges
release) to something where could extend the deferred reading to files other than BAMs - here's what that could look like:
First we have an abstract operator class that represents an input to a read_
function.
# abstract class to represent operations performed on a file
setClass("FileOperator",
contains = "VIRTUAL")
# to perform operations
setClass("BamFileOperator",
slots = c(
input = "BamFile",
param = "ScanBamParam",
paired = "logical"
),
contains = "FileOperator")
dplyr methods can be written at the FileOperator level to start building operations that will be performed prior to loading. Then the DeferredGenomicRanges class is defined as follows:
setClass("DeferredGenomicRanges",
slots = c(file_operation = "FileOperator"),
contains = "DelegatingGenomicRanges")
and dplyr methods for DeferredGenomicRanges are pushed to the FileOperator if the delegate is empty or just acts on the delegate if the data is loaded. What do you think?
Sounds reasonable. Would be nice to generalize that.
@lawremi been going through this again - what do you think should happen in this case:
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
result <- read_bam(fl) %>% filter(is_paired, !is_duplicate) %>% select(mapq, qual)
If you cache the data using the filter then you can't select on the cached data unless you reread the BAM again. This is not a problem once you perform some other operation like summarise as that will load the cache explicitly.
I'm not sure what you mean. You're saying that somehow data has been read already into the cache, but now the user is selecting additional fields that are not present in the cache? Sure, the simplest thing is to invalidate the cache, requiring another read. But filter()
does not always cache, right?
Yes that's what I'm saying. I've made it so filter doesn't always cache in the operator-class branch https://github.com/sa-lee/plyranges/commit/3746c5a12d8b7a60fafe556bcc3f04ee73f27792 but was thinking maybe its better to be explicit with filling the cache?
I guess the issue is when to override and invalidate the cache - I guess if the user supplies arguments that obviously refer to a file operation and are names that are not present in the the cache should be invalidated.
Yes, I agree, I think.
Been thinking about this one lately: is it worth considering moving the IO functions out to their own package and implementing them? We've talked about making all the genomic file readers completely deferred and lazy using the structure I've outlined in the Operators class and this seems like an orthogonal goal to plyranges which in my mind should only do data wrangling and not touch data import.
Makes sense.