split/apply/combine paradigm
I'd like to get started on this one and use this tracker to collect and discuss ideas.
AFAIR @lawremi suggested back in September to use split/by (split), bp*apply (apply) and stack (combine).
I'm rather unsure what functionality is needed. Usually I'm fine with split, bplapply and l*ply/Reduce.
split/apply/combine is a nice mental model, but maybe it does not need explicit representation in code. Another direction is thinking about faster ways to iterate, i.e., can we form partitions of data more efficiently? The data.table package has some interesting approaches.
Michael
On Thu, Dec 12, 2013 at 2:42 AM, Michel [email protected] wrote:
I'd like to get started on this one and use this tracker to collect and discuss ideas.
AFAIR @lawremi https://github.com/lawremi suggested back in September to use split/by (split), bp*apply (apply) and stack (combine).
I'm rather unsure what functionality is needed. Usually I'm fine with split, bplapply and l*ply/Reduce.
— Reply to this email directly or view it on GitHubhttps://github.com/Bioconductor/BiocParallel/issues/29 .
When I need to do a split-apply-combine type of operation, I usually turn to plyr::ddply.
Yes, that's a useful tool. Would be nice to have a similar API on top of BiocParallel (and thus BatchJobs). We worked toward making aggregate() behave that way through omission of the LHS, but I think we ended up punting due to release deadlines. Also, we'd want it to be more generic, with support for e.g. GRanges. I rarely use a data.frame.
On Fri, Dec 13, 2013 at 5:39 PM, Ryan Thompson [email protected]:
When I need to do a split-apply-combine type of operation, I usually turn to plyr::ddply.
— Reply to this email directly or view it on GitHubhttps://github.com/Bioconductor/BiocParallel/issues/29#issuecomment-30556845 .
Any opposition to closing this issue?
This issue sort of depends on having a clean API in base R for aggregation. Currently, aggregate and friends fall a bit short. Once we have that, then BiocParallel will need a corresponding frontend. Perhaps there is no need for a specific issue.
It would seem that BiocParallel needs a bp analog to every member of the apply family. In ddR, we instead define data structures that represent partitioned, distributed data that is managed by some computational engine, so we are able to use existing generics, with implicit parallelism.
OK. I've marked this as an enhancement.