matplotlib
matplotlib copied to clipboard
[MNT]: Try improving doc build speed by using PyStemmer
Summary
See the note in https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-html_search_language
Proposed fix
I propse to measure what impact the use of PyStemmer would have on our docs for
- build from scratch
- incremental build without any changes to the previous build.
as an order-of-magnitude expectation, I know that it speeds up the sphinx-doc build by ~ 10%.
Good first issue - notes for new contributors
This issue is suited to new contributors because it does not require understanding of the Matplotlib internals. To get started, please see our contributing guide.
We do not assign issues. Check the Development section in the sidebar for linked pull requests (PRs). If there are none, feel free to start working on it. If there is an open PR, please collaborate on the work by reviewing it rather than duplicating it in a competing PR.
If something is unclear, please reach out on any of our communication channels.
Hello, has this been taken? May I complete this as my first issue please?
what would be the difficulty level of this issue?
The difficulty is easy: What you need to do:
- Set up the development environment: https://matplotlib.org/devdocs/devel/development_setup.html
- Build the docs https://matplotlib.org/devdocs/devel/document.html and measure the build time
We would like to have build time durations for
- full build (clean any sphinx output before)
- incremental build, i.e. build after a full build without any changes First measue the two variants as is, then install pystemmer and redo the measurements
I suggest to repeat building all four variants 2-3 times to have an estimate on the variation of the build times. This will make it possible to judge wether an observed change in build time with/without pystemmer is significant or just noise.
Essentially, the result should be a table of build times like this (numbers made up):
| default | with pystemmer | |
|---|---|---|
| full build | 10:36 10:45 10:20 |
10:15 9:37 9:48 |
| incremental build | 1:34 1:36 1:29 |
1:23 1:29 1:22 |
Hello, I can try to run this test in the next couple of days, and I will provide the results as soon as possible.
For my computer, the time measurements before and after using PyStemmer are as follows:
| Default | Using PyStemmer | |
|---|---|---|
| Full Build | 7.52, 7.44, 8.10 | 7.01, 6.54, 6.52 |
| Incremental Build | 6.14, 5.52, 5.57 | 5.40, 5.32, 5.28 |
On average, the full build time without PyStemmer is 7.55, while with PyStemmer it is 6.56, resulting in a speed improvement of approximately 13%. The average incremental build time without PyStemmer is 6.01, and with PyStemmer it is 5.33, showing a speed improvement of around 8%. There may be some measurement errors. Should I open a PR for these results to change the documentation or something similar? @timhoffm
Thanks @Anson0028 your measurements confirm the ~10% speedup assumption. I believe it's worth adding PyStemmer to our CI setup. Do you want to open a PR for that?
How to add PyStemmer to the CI setup? I want to do it but I don't know how to do it. Sorry to waste your time on this simple question
Add it to https://github.com/matplotlib/matplotlib/blob/main/requirements/doc/doc-requirements.txt
I think it's also reasonable to add it to https://github.com/matplotlib/matplotlib/blob/main/environment.yml
Hmmm we still build a search index? Would the faster to suppress building the index?
I don't know that one can prevent building a search index. Do you know how? - This would at least be interesting for measuring how much time is left for PyStemmer.
Also, I often use search in the CI doc build to navigate to changed docs of the PR, so I would be somewhat reluctant to deactivate that.
I took a quick look, and didnt' see a way to deactivate. It'd be great it if could be turned off, and you could use an old index if needed, but I don't really know how the index works, so maybe that is not possible.
I just opened a PR to add PyStemmer to the environment file. Please let me know if there are any errors.
I have got the environment installed but when running the command "make html". it generates an error "No module named 'matplotlib.colorizer'". I checked in the local directory and see that this file actuallly exists there. Just wondering why this problem has occurred. @timhoffm