unconf18 icon indicating copy to clipboard operation
unconf18 copied to clipboard

Caching for drake

Open ldecicco-USGS opened this issue 6 years ago • 7 comments

Data scientists are expert at mining large volumes of data to produce insights, predict outcomes, and/or create visuals quickly and methodically. drake (https://github.com/ropensci/drake) has solved a lot of problems in the data-science-pipeline, but one thing we still struggle with is how to effectively collaborate on a large-scale project, without each contributor needing to run all of the workflow, or separating the workflows into many dis-jointed smaller workflows. In some large-scale projects, this is just not feasible.

It would be awesome if a wide community of R developers could come together and try to create a way for drake to have a collaborative caching feature.

My group had set up a wrapper package for remake (drake's predecessor) that allows tiny indicator files to be pushed up to github. These indicator files let the user know that the target was complete and the data was pushed up to some common caching location. The next user would do an upstream pull request from Github, pull down the indicator file. The new user would not need to re-run a target that some other collaborator had already run, but instead pull the data down (if it's needed) rather create it from the workflow. It got a bit awkward because we needed 2-3 remake targets to accomplish this, and that tripped up our "non-power-user" collaborators.

I'd propose the first step would be to develop caching workflow to Google Drive (using the googledrive package). Once the process was flushed out with using Google Drive, it could be more easily expanded to other data storage options (AWS using the aws.s3 package for example).

My gut says this might need to be a wrapper or companion package to drake (to keep the dependent packages minimized), but not sure. @wlandau and other drake experts: I would looove to hear any feedback you have on this idea. If in fact this issues is not-an-issue (ie...drake can already handle caching and I just missed it...totally possible...), then we could morph this issues into a group that helps create more content for a drake blogdown/bookdown book!

The wrapper package for remake is here: https://github.com/USGS-R/scipiper

#12 is another drake-based project.

ldecicco-USGS avatar Apr 18 '18 19:04 ldecicco-USGS