jupyter-cache
jupyter-cache copied to clipboard
Review notebook cacheing and execution packages
A place to discover and list other tools that do some form of notebook cacheing / execution / storage abstractions
- Scrapbook (metadata tagging for python objects and cell outputs)
- Bookstore (storage layer on S3 for notebooks)
- Zarr (chunked storage interface https://zarr.readthedocs.io/en/stable/)
- tinydb is a well-used, lightweight package with a simple JSON database API. Different storage classes can be used, which can also be wrapped in Middleware to customise their behaviour:
>>> from tinydb.storages import JSONStorage
>>> from tinydb.middlewares import CachingMiddleware
>>> db = TinyDB('/path/to/db.json', storage=CachingMiddleware(JSONStorage))
scrapbook contains (in-memory only) classes to represent a collection of notebooks Scrapbook, and a single notebook Notebook.
Of note, is that these have methods for returning notebook/cell execution metrics (like time taken), which they presumably store during notebook execution.
They also provide methods to access 'scraps' which are outputs stored with name identifiers (see ExecutableBookProject/myst_parser#46)
This is the link to the cacheing currently implemented by @mmcky and @AakashGfude: https://github.com/QuantEcon/sphinxcontrib-jupyter/blob/b5d9b2e77fdc571c4c718e67847020625d096d6d/sphinxcontrib/jupyter/builders/jupyter_code.py#L119
Another thought I had, is to look at git itself and e.g. GitPython. I could conceive of something like the cache being its own small repository and when you add a new notebook or update one, you 'stage' it, then on execution you get all the 'staged' notebooks, run them, then commit back the final notebooks.
- rossant/ipycache (last commit 2016), SmartDataInnovationLab/ipython-cache (last commit 2018) are both examples of cell level magics that pickle the outputs of cells for later use.
- mkery/Verdant (last commit Oct 24, 2019) is a JupyterLab extension that automatically records the 'history' of Jupyter notebook cells, and stores them in a
.ipyhistoryJSON file. Note, the code is all written in TypeScript.
Another thought I had, is to look at git itself and e.g. GitPython. I could conceive of something like the cache being its own small repository and when you add a new notebook or update one, you 'stage' it, then on execution you get all the 'staged' notebooks, run them, then commit back the final notebooks.
I think this is the kinda thing that some more bespoke notebook UIs do. E.g., I believe that Gigantum.IO (a proprietary cloud interface for notebooks) commits notebooks to a git repository on-the-fly, and then gives you the option to go back in history if needed. I don't believe they do any execution cacheing, just content cacheing
Thank you for creating this helpful resource!
As I am on the search myself, here is another pointer (which I still need explore):
dask.cache and cachey