unconf18
unconf18 copied to clipboard
ROpenSci storage for package caching
Lots of packages need caching. When things get complicated enough, packages may need to access their own eternal data to do their internal stuff. This means caching somewhere, somehow, in a form that is reliable and available. This costs money.
Oh, I don't have any of that; okay, I can't do that package.
And there it ends. How about considering applications to ROpenSci to (financially) support caching via some suitable provider? The flipper package is a case in point. This works at the moment because it only trawls the CRAN_package_db. We would like to extend this to all man/ directories, all non-CRAN packages on github, and many potential other places. This is impossible without some sorta cloudy caching scheme.
Any chance of ROpenSci having an application scheme whereby those with existing ROpenSci packages apply for access to a wee chunk of server space?
I was literally just crafting a proposal for a caching process for drake! I think it's a slightly different use-case than what you are proposing here, but maybe we can combine forces and think of all the caching use-cases and needs. (See #30 )
When it comes to caching, I am a huge fan of @richfitz's storr package. It's a general key-value storr with an expanding variety of backends ("drivers"), including storr_rds() and storr_dbi(). Maybe a remote storr driver would help here? Related: http://richfitz.github.io/storr/articles/external.html.
thoughts (chat with @mpadge and @sckott):
- remote caching
- for people moving between jobs, best to host data with an organization that is longer lived (e.g., ropensci)
- e.g. flipper (see above)
- scheduling: good, but not as important as the caching itself
- in onboarding: could have a checkbox for requesting data caching, and which options (if package accepted then we can take over caching)
- cost: would need a way to calculate cost. parts to use: package downloads, number of requests per use to the cache * cost per download from S3
- how does authentication work: hash S3 keys to give to people? can we whitelist certain people?
- we could require that all jobs are updated via our server with our S3 keys, but then people can't update manually
- pulling data from S3 is easy, probably via
storr - need more use cases!