BiocParallel icon indicating copy to clipboard operation
BiocParallel copied to clipboard

split/apply/combine paradigm

Open mllg opened this issue 12 years ago • 6 comments

I'd like to get started on this one and use this tracker to collect and discuss ideas.

AFAIR @lawremi suggested back in September to use split/by (split), bp*apply (apply) and stack (combine).

I'm rather unsure what functionality is needed. Usually I'm fine with split, bplapply and l*ply/Reduce.

mllg avatar Dec 12 '13 10:12 mllg

split/apply/combine is a nice mental model, but maybe it does not need explicit representation in code. Another direction is thinking about faster ways to iterate, i.e., can we form partitions of data more efficiently? The data.table package has some interesting approaches.

Michael

On Thu, Dec 12, 2013 at 2:42 AM, Michel [email protected] wrote:

I'd like to get started on this one and use this tracker to collect and discuss ideas.

AFAIR @lawremi https://github.com/lawremi suggested back in September to use split/by (split), bp*apply (apply) and stack (combine).

I'm rather unsure what functionality is needed. Usually I'm fine with split, bplapply and l*ply/Reduce.

— Reply to this email directly or view it on GitHubhttps://github.com/Bioconductor/BiocParallel/issues/29 .

lawremi avatar Dec 12 '13 17:12 lawremi

When I need to do a split-apply-combine type of operation, I usually turn to plyr::ddply.

DarwinAwardWinner avatar Dec 14 '13 01:12 DarwinAwardWinner

Yes, that's a useful tool. Would be nice to have a similar API on top of BiocParallel (and thus BatchJobs). We worked toward making aggregate() behave that way through omission of the LHS, but I think we ended up punting due to release deadlines. Also, we'd want it to be more generic, with support for e.g. GRanges. I rarely use a data.frame.

On Fri, Dec 13, 2013 at 5:39 PM, Ryan Thompson [email protected]:

When I need to do a split-apply-combine type of operation, I usually turn to plyr::ddply.

— Reply to this email directly or view it on GitHubhttps://github.com/Bioconductor/BiocParallel/issues/29#issuecomment-30556845 .

lawremi avatar Dec 14 '13 02:12 lawremi

Any opposition to closing this issue?

vobencha avatar Nov 04 '15 15:11 vobencha

This issue sort of depends on having a clean API in base R for aggregation. Currently, aggregate and friends fall a bit short. Once we have that, then BiocParallel will need a corresponding frontend. Perhaps there is no need for a specific issue.

It would seem that BiocParallel needs a bp analog to every member of the apply family. In ddR, we instead define data structures that represent partitioned, distributed data that is managed by some computational engine, so we are able to use existing generics, with implicit parallelism.

lawremi avatar Nov 04 '15 17:11 lawremi

OK. I've marked this as an enhancement.

vobencha avatar Nov 06 '15 15:11 vobencha