examples
examples copied to clipboard
Simplify the project build step to avoid errors
Building an example for the site can prove to be difficult, even if the project environment is locked. This happens because at the project build step some extra dependencies are installed in the default
environment of the project to run nbsite
:
https://github.com/pyviz-topics/examples/blob/17a17be1a1b159095be55801202741e049a780e8/.github/workflows/build.yml#L49-L53
In a way, this defeats the purpose of locking the environment originally, since this leads to changes in the default environment locked by the project creator. However I think there may be a way to avoid having to install extra dependencies in the default environment. An evaluated project contains:
- a thumbnails folder, generated by
nbsite
if not already present - the executed/evaluated notebooks, generated by
nbsite
- an archive of the project, built with the
doit archive_project
command that does not require extra dependencies - RST files, generated by
nbsite
As far as I understand it, the real purpose of running nbsite
during the build step is mostly to generate evaluated notebooks and thumbnails, these are really the tricky bits (they are then pushed during the project build step to the evaluated branch, and later fetched during the docs build step).
The obvious alternative for the thumbnails generation is to manually generate them, which I think would be an improvement anyway since the automatically generated thumbnails don't always look nice, require extra dependencies (selenium) and can be a little heavy. If not provided by the project creator, a default thumbnail could be added by a simple script.
As for evaluating the project notebooks, nbsite
does generates them by using nbconvert
which happens to be a dependency of the notebook
package:
The notebook
package is a dependency of each project, so each project has indeed a version of nbconvert
always installed. nbconvert
can execute and save a notebook with the command jupyter nbconvert --execute --to notebook nb.ipynb
. nbsite
uses nbconvert
programmatically to execute a notebook, however note that it subclassesExecutePreprocessor
to customize a couple of things. If these changes are still required to execute the notebooks of examples, then examples should have a small evaluate_notebooks.py
script that imports nbconvert
to execute and export a notebook (it isn't that much code).
As such the build step could be turned into the following, and would no longer require any dependency to be installed to run the notebooks.
anaconda-project prepare --directory $DIR
source activate $DIR/envs/default
python scripts/evaluate_notebooks.py --directory $DIR
source deactivate
conda install nbsite
doit build_project --name $DIR
Note the last doit build_project
step that actually runs nbsite
, except that this time the notebook has already been evaluated and saved in the /doc
folder, so nbsite
won't re-evaluate it again (reminder to self: might need to deactivate the auto-execution of notebooks by myst-nb).
I wonder whether that last step could be simplified, not running nbsite
at all in the project build step. I'm not sure to which extent the RST files are required to be saved in the evaluated branch, they are generated automatically by nbsite
when a gallery is defined so they would also be created in the docs build step.
If this all works, the documentation should indicate that new projects must have nbconvert
added to their dependencies (who knows if it's always gonna be a dependency of notebook
).
Thanks for outlining exactly the sort of (improved) approach I've had in mind for a while now! The question then is who will implement it :-)
I love seeing a good plan like this, because then I can just imagine it's already been implemented and I can be happy. :-)
The question then is who will implement it :-)
The one who opened the issue! The first step though in my mind is to update nbsite
to be able to use nbconvert >= 6
, right now it's stuck to nbconvert <6
. In that process I will see if subclassing ExecutePreprocessor
is still required.
There the main thing we needed is the functionality of the SkipOutput
, NotebookSlice
and NotebookSlice
Preprocessors
though I think all of these can be applied to an evaluated notebook without needing an execution context.
I hope I am right in thinking that only thumbnailing needed to be run in the live environment...
I hope I am right in thinking that only thumbnailing needed to be run in the live environment...
I also hope so.
As for the preprocessors, they can be applied to the evaluated notebook.
If I remember right, there is also some preamble python code to execute first (e.g. to run imports without displaying them). The same effect can be achieved by adding a fixed cell to all notebooks and slicing it out after.
Contributors have the possibility to push directly evaluated notebooks to the evaluated
branch if their notebook(s) require datasets so large that they would take too long to download during the build, or there wouldn't be just enough storage on the CI system to save them. This is well described in this section of the documentation.
There's currently no way for a project to declare that it has notebooks that should not be evaluated in the build process. There are two website building options offered to the contributors:
-
orphans
: to list notebooks that should be rendered but not linked from the main page -
skip
: to list notebooks that should be skipped entirely when building the website
These two options are described in the documentation. In practice orphans
doesn't appear to be used at all, skip
is used once in iex_trading:
https://github.com/pyviz-topics/examples/blob/10772f10bc5e8dfd7c3bd8a894d372135176b39a/iex_trading/anaconda-project.yml#L14-L15
(1) I'd suggest adding a new option called skip_evaluation_project_build
(naming suggestions welcome!) that would list the notebooks that should not be evaluated part of the project build. I specifically target the project build as the repo has a GH Action that runs the project tests on project changes, that apparently uses smaller datasets stored in the test_data
folder instead of the full datasets. This is also described in the documentation.
(2) I'd also suggest that the evaluated notebooks should be pushed by contributors in their PR branch vs. directly to the evaluated
branch as explained in the documentation.
Enforcing (1) and (2) would make it a lot easier to update a notebook whose evaluation should be skipped. Currently the notebook has either to be re-executed locally, or changed (e.g. typo) and then committed to evaluated
. With (1) and (2), if an already-evaluated notebook is modified by a contributor, during a project build it would simply be copied as is to the evaluated
branch.
Another suggestion that came to me while having a look at all the anaconda projects, it's difficult to differentiate between the keys that are for anaconda-project
itself and those that are used to build examples
. I would rather have them all nested in a single key, e.g. examples_info
or examples_config
. I assume user_fields
would still have to be set at the root level, even if it's not 100% clear to me how it's used and by which tool.
I've gathered information about the data source of each project, to try to get a better picture of how that works and identify the notebooks whose evaluation should be skipped in the project build process.
The data source column contains:
-
x
: the project doesn't require any external data -
project
:anaconda-project
takes care of downloading the data -
notebook/
: data obtained when the notebook is executed-
intake
: from an Intake catalog (local or online) -
download
: data downloaded in some way, e.g. withpd.read_csv(external_url)
-
repo
: from the repo directly
-
-
indications
: only indications as how to download the data are provided
has small test data indicates whether the project has an accompanying folder in test_data/
. note is just additional information that I thought was worth recording, like the absence of lock files.
project | data source | has small test data | note |
---|---|---|---|
attractors | x | ||
bay_trimesh | project | yes | |
boids | x | ||
carbon_flux | notebook/intake | ||
census | project | yes | |
datashader_dashboard | notebook/intake | yes | very large data in there |
euler | x | ||
exoplanets | project | yes | no lock file |
gapminders | notebook/download | with pd.read_csv(external_url) | |
genetic_algorithm | x | ||
gerrymandering | project | yes | |
glaciers | notebook/repo | ||
goldbach_comet | x | ||
gull_tracking | project | yes | evaluated notebook committed on master |
heat_and_trees | notebook/intake | yes | |
hipster_dynamics | x | ||
iex_trading | project | yes | |
landsat | project | yes | |
landsat_clustering | notebook/intake | no lock file | |
landsat_classification | notebook/intake | catalog not local, fetched pulled from S3/earth-data` | |
lsystems | x | ||
ml_annotators | x | ||
network_packets | project | yes | |
nyc_buildings | project | yes | |
nyc_taxi | project | yes | |
opensky | project | yes | accompanying script to download the data (cron job that ran 4 days) |
osm | indications | yes | Why not having the S3 download links in the project? |
particle_swarms | x | ||
penguin_crossfilter | notebook/repo | ||
portfolio_optimizer | notebook/repo | ||
seattle_lidar | notebook/intake | no lock file | |
ship_traffic | project + notebook/repo | yes | |
square_limit | x | ||
sri_model | x | ||
uk_researchers | project | yes | |
voila_gpx_viewer | x | ||
walker_lake | notebook/intake | yes |
Now, this little detective work has not yet allowed me to determine exactly all the notebooks that should not be evaluated. The only obvious ones at this stage are those of the osm
project, since the project doesn't provide any way to download the data automatically. I'm not sure it's the only one, I'll need to search a little bit.
For information of the files we store on S3/datashader-data these ones are larger than 1 GB:
All of these proposals sound great to me; I like the direction this is going!
Couple of notes after talking with Jean-Luc:
- Suggested naming the new option
skip_doc_evaluation
- Suggested enforcing the presence of a lock file in a project for a build to complete (from me: a YAML linter wouldn't hurt in addition of that)
Yes, and I think storing evaluated notebooks on the master branch (only for the projects that need it) is not a bad idea either (using the skip_doc_evaluation
key to drive CI copying the notebook over to the evaluated branch).
This is implemented.