plyranges
plyranges copied to clipboard
parallelisation
from @lawremi:
It would be interesting to explore an API based on BiocParallel; something like:
Set the BiocParallelParam: parallelize(x, param)
Specify chunking: chunk_by(x, variable) # often seqnames chunk_count(x, count) chunk_size(x, size) chunk_by_overlap(x, ranges) # like tiles
That might help with our argument about scalability.
So I'm imagining the design of this to look fairly similar to the (maybe even inherit from) GroupedGenomicRanges
class except we have two additional slots:
- a param which is a BiocParallelParam object
- an iterator function
The parallelize function initializes the param slot and chunk by initializes the iterator. Then the core verbs dispatch with bpiterate
or bplapply
over the chunks.
Yea that sounds about right. The iterator might perform a vectorized operation up-front to find the partitions and then use those as it iterates.