jupyterlite-sphinx icon indicating copy to clipboard operation
jupyterlite-sphinx copied to clipboard

Size reduction aproaches

Open agriyakhetarpal opened this issue 10 months ago • 2 comments

Currently, we generate notebooks for the TryExamples directive, and the JupyterLite, NotebookLite, and Voici directives can take a path to a notebook.

While we can disable the JupyterLite source maps to reduce build size (a technique used downstream in https://github.com/scikit-learn/scikit-learn/pull/26246, https://github.com/numpy/numpy/pull/26745, and https://github.com/sympy/sympy/pull/27419), I've opened this issue as an open-ended item to see if jupyterlite-sphinx as a companion to JupyterLite can reduce its footprint by reducing the size of the notebooks that are eventually copied into the JupyterLite folder.

  • Use https://github.com/arve0/ipynbcompress? It has not been maintained for the last two years; maybe we can fork it or take over maintenance via PEP 541?
  • Use optipng (if available) or projects with Python bindings to it, or pillow as an optional dependencies to reduce the size of images in the docs: https://sphinx-gallery.github.io/stable/gen_modules/sphinx_gallery.utils.optipng.html
  • Use nbstripout if enabled via a global config option to clear all outputs (except the jupyterlite_sphinx_strip tag) and kernel metadata from all notebooks (maybe not for the TryExamples notebooks, but this would be useful for long-form notebooks – the scikit-learn docs via Sphinx-Gallery already to seem to do this: https://sphinx-gallery.github.io/stable/auto_examples/plot_9_multi_image_separate.html.
    • We don't currently do this for Markdown notebooks that were added in #221. Since they don't contain the outputs of the cells in their contents, they don't have any outputs upon conversion to IPyNB either.
    • However, conventional IPyNB files can indeed contain outputs, which we can explore stripping.

Reductions in sizes will be helpful for:

  • projects deploying documentation via GitHub Pages or on other static webpage hosts
  • reducing bandwidth usage for readers of said documentation

agriyakhetarpal avatar Jan 09 '25 15:01 agriyakhetarpal

Okay, so, as an experiment, I tried reducing the size of the notebooks generated by the TryExamples directive – thinking that those notebooks have more significant numbers in comparison to the ones connected to the NotebookLite/JupyterLite directive(s) since they can quickly go into the thousands, based on how many docstrings exist in the entire documentation source for a package – sadly the results are not helpful. :(

By removing the outputs from the UUID-based notebooks from numpy/numpy#26745, the size of 1496 total notebooks was reduced from 4.5 MiB to 2.5 MiB, and a similar test for SymPy with 2519 notebooks/example revealed a reduction from 6.8 MiB to 3.9 MiB. Hence, this sounds like a paltry improvement of just 1.44% and is not really worth incorporating, especially when NumPy's total docs size without JupyterLite's source maps is ~138 MiB. Even with enabling global docstring examples for Matplotlib, which also uses Sphinx-Gallery and has a lot of images in its notebook outputs, I didn't see much of a reduction (20 MiB – brought down to 557 MiB).

I'll leave this issue open in case there's something I am missing in this aim to reduce the build size from jupyterlite-sphinx's side that anyone else can point out. Otherwise, we can close and try to find optimisation options in JupyterLite itself.

agriyakhetarpal avatar Jan 10 '25 00:01 agriyakhetarpal

It is also likely that the blob in git (and on the wires) are actually gzippe'd so the actual gains are lower.

Carreau avatar Jan 13 '25 09:01 Carreau