plyranges icon indicating copy to clipboard operation
plyranges copied to clipboard

Deferred I/O

Open sa-lee opened this issue 7 years ago • 11 comments

To summarise our discussion earlier here's what we're going to implement with regards to I/O (see also #15):

  • [x] read_bam with paired = TRUE returns a grouped granges grouped by pair.
  • [x] implement DeferredGRanges class with a cache slot and an ops to allow for deferred reading of data inputs.
  • [x] implement specs for filtering and selection on the DeferredRanges according to @lawremi suggestion on Oct 17 for issue #1
  • [x] chop functions
  • [ ] implement method for putting GRanges from wide to long form.

sa-lee avatar Dec 04 '17 02:12 sa-lee

I've had a go of this now - see commit https://github.com/sa-lee/plyranges/commit/0fadb502928aa045fc24dbc5f4bb27d32cb804bd. I'm not sure if this is the kinda approach you were thinking, let me know.

Here's an example of reading a bam:

library(plyranges)
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
bam <- bam_file(fl) %>%
   select_fields(qual, flag) %>%
   filter_by_overlaps(GRanges("seq1", IRanges(1000, 1100)) %>%
   read_bam()

bam
GRanges object with 161 ranges and 6 metadata columns:
        seqnames       ranges strand |                    qual      flag    which_label
           <Rle>    <IRanges>  <Rle> |          <PhredQuality> <integer>          <Rle>
    [1]     seq1  [970, 1004]      + | =8=====;=8...5==;=;.;25        99 seq1:1000-1100
    [2]     seq1  [971, 1005]      + | <<<<<<<<<;...<<<<<-;<4;       163 seq1:1000-1100
    [3]     seq1  [972, 1006]      + | <<<<<<<<<<...<<<.<<.<,<        99 seq1:1000-1100
    [4]     seq1  [973, 1007]      + | <<<<<<7<<<...&<58<<6:::       163 seq1:1000-1100
    [5]     seq1  [974, 1008]      + | <<<<<<<<<<...<<<<:<::8<        99 seq1:1000-1100
    ...      ...          ...    ... .                     ...       ...            ...
  [157]     seq1 [1098, 1132]      - | ;6<02;<<<<...<<<<<<<<;<        83 seq1:1000-1100
  [158]     seq1 [1099, 1133]      - | &;972<;&<9...;<;;3<<3<<        83 seq1:1000-1100
  [159]     seq1 [1099, 1133]      + | ;.;;;;;;;;...;73;;77777       163 seq1:1000-1100
  [160]     seq1 [1100, 1134]      - | .5:,666<)<...<<<5<<7<<<       147 seq1:1000-1100
  [161]     seq1 [1100, 1134]      + | <<<<<<<<<<...;<<*<<<<9<        99 seq1:1000-1100
              cigar    qwidth     njunc
        <character> <integer> <integer>
    [1]         35M        35         0
    [2]         35M        35         0
    [3]         35M        35         0
    [4]         35M        35         0
    [5]         35M        35         0
    ...         ...       ...       ...
  [157]         35M        35         0
  [158]         35M        35         0
  [159]         35M        35         0
  [160]         35M        35         0
  [161]         35M        35         0
  -------
  seqinfo: 2 sequences from an unspecified genome

In the above example, the select and filter tags are modifying an environment that contains a BamFile and scanBamParam object, once read_bam is called the BAM file is actually read into memory.

sa-lee avatar Dec 13 '17 03:12 sa-lee

I was imagining that the pipeline would begin with read_bam() (a deferred read) and perform the read as soon as a getter is called (probably in the implementation of one our wrappers). The idea is that each getter will:

  1. Check whether cached data is available.
  2. If it is, return that.
  3. If it isn't, load the data, cache it and return it.

Setters can just construct and return a new, materialized GRanges, with the newly set value.

As an aside, I think we want to deemphasize stuff like bam_file() because it's a different object type.

lawremi avatar Dec 13 '17 03:12 lawremi

@lawremi I'm thinking of refactoring the DeferredGRanges (most likely in the next plyranges release) to something where could extend the deferred reading to files other than BAMs - here's what that could look like:

First we have an abstract operator class that represents an input to a read_ function.

# abstract class to represent operations performed on a file
setClass("FileOperator", 
         contains = "VIRTUAL")

# to perform operations 
setClass("BamFileOperator", 
         slots = c(
            input = "BamFile",
           param = "ScanBamParam",
           paired = "logical"
         ),
         contains = "FileOperator")

dplyr methods can be written at the FileOperator level to start building operations that will be performed prior to loading. Then the DeferredGenomicRanges class is defined as follows:

setClass("DeferredGenomicRanges",
         slots = c(file_operation = "FileOperator"),
         contains = "DelegatingGenomicRanges")

and dplyr methods for DeferredGenomicRanges are pushed to the FileOperator if the delegate is empty or just acts on the delegate if the data is loaded. What do you think?

sa-lee avatar Apr 09 '18 06:04 sa-lee

Sounds reasonable. Would be nice to generalize that.

lawremi avatar Apr 09 '18 14:04 lawremi

@lawremi been going through this again - what do you think should happen in this case:

fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
result <- read_bam(fl) %>% filter(is_paired, !is_duplicate) %>% select(mapq, qual)

If you cache the data using the filter then you can't select on the cached data unless you reread the BAM again. This is not a problem once you perform some other operation like summarise as that will load the cache explicitly.

sa-lee avatar May 09 '18 01:05 sa-lee

I'm not sure what you mean. You're saying that somehow data has been read already into the cache, but now the user is selecting additional fields that are not present in the cache? Sure, the simplest thing is to invalidate the cache, requiring another read. But filter() does not always cache, right?

lawremi avatar May 09 '18 04:05 lawremi

Yes that's what I'm saying. I've made it so filter doesn't always cache in the operator-class branch https://github.com/sa-lee/plyranges/commit/3746c5a12d8b7a60fafe556bcc3f04ee73f27792 but was thinking maybe its better to be explicit with filling the cache?

sa-lee avatar May 09 '18 04:05 sa-lee

I guess the issue is when to override and invalidate the cache - I guess if the user supplies arguments that obviously refer to a file operation and are names that are not present in the the cache should be invalidated.

sa-lee avatar May 09 '18 06:05 sa-lee

Yes, I agree, I think.

lawremi avatar May 09 '18 13:05 lawremi

Been thinking about this one lately: is it worth considering moving the IO functions out to their own package and implementing them? We've talked about making all the genomic file readers completely deferred and lazy using the structure I've outlined in the Operators class and this seems like an orthogonal goal to plyranges which in my mind should only do data wrangling and not touch data import.

sa-lee avatar Aug 07 '18 21:08 sa-lee

Makes sense.

lawremi avatar Aug 07 '18 21:08 lawremi