ipycache icon indicating copy to clipboard operation
ipycache copied to clipboard

Alternative caching backends

Open dimatura opened this issue 9 years ago • 10 comments

Hi! ipycache is great, but one issue I've run into is that raw pickles are slow and big, specially for large arrays. In the past I've tried a bunch of alternatives (pickle+gzip, hdf5, etc). So I implemented a couple of these as alternative backends in ipycache here: https://github.com/dimatura/ipycache/tree/npyz. They all have tradeoffs, but I think overall something like this could be pretty useful overall. Any interest in a PR? I'd be willing to clean things up.

dimatura avatar Mar 17 '15 17:03 dimatura

Sounds like a great idea! @ihrke would you be willing to review/merge a PR?

rossant avatar Mar 17 '15 18:03 rossant

One important issue is what "cons" would be acceptable. Right now I think using gzipped pickles is pretty painless, as it uses the stdlib and can accept anything picklable. joblib is a close second, that also can store anything picklable but works much better for arrays. It does add a dependency, so I guess there could be some conditional import logic there. bloscpack is currently my favorite for arrays in terms of speed/storage, but only works for arrays. Hickle (based on h5py) I wouldn't currently recommend as it's a bit hackish, though the idea is nice.

dimatura avatar Mar 17 '15 18:03 dimatura

Nice idea. I can do the reviewing/merging. How would we handle choosing the backend? We could either parse the provided filename, hand over an option to the cache-magic or allow the user to set it globally for a notebook. Personally, I would prefer a combination of the last two options. I agree that dependencies are an issue. Would be nice to keep the backends optional and fail with a graceful error in case of missing dependencies.

ihrke avatar Mar 18 '15 08:03 ihrke

@ihrke +1 for all of these ideas, + filename extension parsing as well and fallback to cell-wise/global option.

rossant avatar Mar 18 '15 09:03 rossant

ok, so the hierarchy is:

  1. explicitly provided cell-wise option
  2. globally provided option
  3. filename parsing Meaning that a cell-wise option beats everything else and filename-parsing is the last fallback?

ihrke avatar Mar 18 '15 10:03 ihrke

LGTM

rossant avatar Mar 18 '15 10:03 rossant

Yeah, that hierarchy looks good to me.

dimatura avatar Mar 18 '15 23:03 dimatura

Just throwing this out there, but I have a package (calling it persist right now) that allows one to archive objects using hdf5 for arrays etc.

https://bitbucket.org/mforbes/persist

The idea is to convert objects to executable source code in an importable module, putting large arrays in hdf5 files etc. as needed. This has some significant advantages over pickles in that the persistent archives are less likely to go stale (even if code changes, as long as the API is fixed, objects can be reloaded. Also, if things do break, the archive can be edited by hand to fix things). It also allows one to archive things that can't be pickled (such as functions). As long as one can write source code to specify the object, then it can be archived. One can define a custom representation by providing a single method get_persistent_rep().

I need to clean a few things up, but if this sounds useful, let me know and I will get it ready for release. It would be awesome to get this and issue #13 resolved so I can start using %%cache in a serious way.

mforbes avatar Mar 22 '15 02:03 mforbes

Looks interesting. We could support it as an alternative backend (under the same constraints as the others, i.e., graceful fallback in case the module fails to import etc). Let's wait for @dimatura 's PR before extending it with the persist module (in case it's functional by then).

ihrke avatar Mar 23 '15 13:03 ihrke

do cloudpickle and dill fall into this same category for ipycache backends? also why not have pickle protocol as one of options to cell magic?

den-run-ai avatar Jul 22 '15 16:07 den-run-ai