jupyter-cache How should/would the cache be used remotely?

Originally posted by @choldgraf in https://github.com/ExecutableBookProject/jupyter-cache/pull/6#issuecomment-590100257

Maybe a use-case to consider here.

A team has a really big book, it takes 2 hours to complete. An author forks the book, clones it locally, edits one page. They want to contribute the page back. A few questions:

Do they need to run the entire 2-hour build process locally before seeing what the page looks like? --> seems like this could be handled by letting cache execution step be configurable only to specific files
When they make a PR, does the entire book need to re-build top to bottom on the CI/CD job? --> here the cache could probably be stored as a build artifact in a CI/CD job independent of the .git repository
Is there any way for a "master cache" to be bundled with the book?
- If so, then is that a pattern we want to encourage?

I could see a benefit of committing the cache, in the sense that then git would keep track of changes to the cache and diffs to the pages would propagate through github, clones, etc. However, I worry about a few things:

The cache would probably become gigantic for non-trivial projects, unless it could be incrementally-updated and have some kind of "shallow clone" behavior.
It would require sub-moduling a book repository, so I think it would only work for fairly power-users.
The cache diffs themselves would be binary (I think?) so they wouldn't make any sense in github which would make it hard to know what has changed in the cache.

Feb 24 '20 16:02 chrisjsewell

Our Python lectures take around 1.5 hours to build from scratch, so this is our scenario.

For 99% of our PRs, we just make the edits in RST, generate the ipynb for that one page and then run it manually to see if it looks OK. This is fine for most edits, which typically adjust language or tweak code.

If we're concerned about how this looks in the PDF, say, we generate that one page locally. Sometimes RAs will include an image in the PR to show that the PDF looks fine.

These are imperfect systems but they work OK for the most part. So my vote would be for us to favor simplicity, at least initially, but not committing the cache. (Plus, I'm a reasonably sophisticated user, but submodules still confuse me. My instinct is to fear and distrust them.)

Feb 24 '20 16:02 jstac

@jstac you can never trust two things: politicians, and sub-modules.

I wonder if one potential way to address this would involve meeting another use-case: building single-page documents. If we make the CLI easy for building the HTML or PDF of one page and letting users quickly preview what it looks like, the same machinery could be re-used for people that only want to build a single page and not an entire book...

Feb 24 '20 16:02 choldgraf

Yep, that seems like a good idea. Two birds with one stone, etc. And the single-page use case is certainly important.

Such tools are available in jupinx for reviewing edits to QE lectures. I suppose cross references involving other pages won't work. But, for 99% of cases, it's perfectly fine.

Feb 24 '20 17:02 jstac

Glad you have mentioned this @jstac. It will be really important to support rendering of single pages for usability. We currently do this using environment variable FILES= and passing that through to SPHINX. I agree the CLI tool needs to cater to this and make it easier :-)

Feb 25 '20 00:02 mmcky

An approach I was playing around with for the jupyter book CLI was to use jupyter-book page: https://jupyterbook.org/features/page.html

perhaps we could use the same pattern, but also allow for PDF output with a kwarg or something?

Feb 25 '20 00:02 choldgraf

As discussed with @mmcky, jupinx currently uses a static cache, housed in the Sphinx _build folder on an Amazon server. The build is persisted for all execution triggers (cron jobbed every hour), which run a 'git-pull' then sphinx-build. For this use case, the (just merged) hash implementation of jupter-cache should work fine.

@mmcky also noted that their current (sphinx based) cache implementation doesn't work on Travis CI; presumably because the cache is compressed/un-compressed, changing the file mtime's that sphinx uses to determine re-builds (matching to a dictionary stored in the pickled environment object). This wouldn't be an issue for jupyter-cache since it is hash based.

It would also be interesting to think how it might work with GitHub actions, CircleCI and ReadTheDocs builds.

Feb 25 '20 06:02 chrisjsewell

Just a note to self, in case this is issue is encountered (sqlite on NFS): jupyter/notebook#1782

Mar 13 '20 18:03 chrisjsewell

Another related note: for jupyter book I was starting to collect a repository with several CI/CD patterns that could be used to deploy books: https://github.com/choldgraf/jupyter-book-deploy-demo

I think it'd be helpful if we replicated that repository for the new build system, ideally with multiple levels of complexity that users may want (e.g. vanilla build w/o execute then host online, execute and build, and execute+cache and build

Mar 13 '20 19:03 choldgraf

jupyter-cache jupyter-cache copied to clipboard

How should/would the cache be used remotely?

jupyter-cache
jupyter-cache copied to clipboard