jupyter-cache icon indicating copy to clipboard operation
jupyter-cache copied to clipboard

Review notebook cacheing and execution packages

Open choldgraf opened this issue 5 years ago • 7 comments

A place to discover and list other tools that do some form of notebook cacheing / execution / storage abstractions

  • Scrapbook (metadata tagging for python objects and cell outputs)
  • Bookstore (storage layer on S3 for notebooks)
  • Zarr (chunked storage interface https://zarr.readthedocs.io/en/stable/)

choldgraf avatar Feb 17 '20 07:02 choldgraf

  • tinydb is a well-used, lightweight package with a simple JSON database API. Different storage classes can be used, which can also be wrapped in Middleware to customise their behaviour:
>>> from tinydb.storages import JSONStorage
>>> from tinydb.middlewares import CachingMiddleware
>>> db = TinyDB('/path/to/db.json', storage=CachingMiddleware(JSONStorage))

chrisjsewell avatar Feb 17 '20 08:02 chrisjsewell

scrapbook contains (in-memory only) classes to represent a collection of notebooks Scrapbook, and a single notebook Notebook.

Of note, is that these have methods for returning notebook/cell execution metrics (like time taken), which they presumably store during notebook execution.

They also provide methods to access 'scraps' which are outputs stored with name identifiers (see ExecutableBookProject/myst_parser#46)

chrisjsewell avatar Feb 17 '20 08:02 chrisjsewell

This is the link to the cacheing currently implemented by @mmcky and @AakashGfude: https://github.com/QuantEcon/sphinxcontrib-jupyter/blob/b5d9b2e77fdc571c4c718e67847020625d096d6d/sphinxcontrib/jupyter/builders/jupyter_code.py#L119

chrisjsewell avatar Feb 19 '20 11:02 chrisjsewell

Another thought I had, is to look at git itself and e.g. GitPython. I could conceive of something like the cache being its own small repository and when you add a new notebook or update one, you 'stage' it, then on execution you get all the 'staged' notebooks, run them, then commit back the final notebooks.

chrisjsewell avatar Feb 19 '20 11:02 chrisjsewell

  • rossant/ipycache (last commit 2016), SmartDataInnovationLab/ipython-cache (last commit 2018) are both examples of cell level magics that pickle the outputs of cells for later use.
  • mkery/Verdant (last commit Oct 24, 2019) is a JupyterLab extension that automatically records the 'history' of Jupyter notebook cells, and stores them in a .ipyhistory JSON file. Note, the code is all written in TypeScript.

chrisjsewell avatar Feb 19 '20 12:02 chrisjsewell

Another thought I had, is to look at git itself and e.g. GitPython. I could conceive of something like the cache being its own small repository and when you add a new notebook or update one, you 'stage' it, then on execution you get all the 'staged' notebooks, run them, then commit back the final notebooks.

I think this is the kinda thing that some more bespoke notebook UIs do. E.g., I believe that Gigantum.IO (a proprietary cloud interface for notebooks) commits notebooks to a git repository on-the-fly, and then gives you the option to go back in history if needed. I don't believe they do any execution cacheing, just content cacheing

choldgraf avatar Feb 19 '20 15:02 choldgraf

Thank you for creating this helpful resource!

As I am on the search myself, here is another pointer (which I still need explore):

dask.cache and cachey

eldad-a avatar May 04 '20 08:05 eldad-a