matplotlib icon indicating copy to clipboard operation
matplotlib copied to clipboard

[MNT]: Try improving doc build speed by using PyStemmer

Open timhoffm opened this issue 1 year ago • 4 comments

Summary

See the note in https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-html_search_language

Proposed fix

I propse to measure what impact the use of PyStemmer would have on our docs for

  1. build from scratch
  2. incremental build without any changes to the previous build.

as an order-of-magnitude expectation, I know that it speeds up the sphinx-doc build by ~ 10%.

timhoffm avatar Jul 13 '24 23:07 timhoffm

Good first issue - notes for new contributors

This issue is suited to new contributors because it does not require understanding of the Matplotlib internals. To get started, please see our contributing guide.

We do not assign issues. Check the Development section in the sidebar for linked pull requests (PRs). If there are none, feel free to start working on it. If there is an open PR, please collaborate on the work by reviewing it rather than duplicating it in a competing PR.

If something is unclear, please reach out on any of our communication channels.

github-actions[bot] avatar Sep 13 '24 08:09 github-actions[bot]

Hello, has this been taken? May I complete this as my first issue please?

u7228810 avatar Oct 10 '24 08:10 u7228810

what would be the difficulty level of this issue?

u7228810 avatar Oct 10 '24 08:10 u7228810

The difficulty is easy: What you need to do:

  1. Set up the development environment: https://matplotlib.org/devdocs/devel/development_setup.html
  2. Build the docs https://matplotlib.org/devdocs/devel/document.html and measure the build time

We would like to have build time durations for

  • full build (clean any sphinx output before)
  • incremental build, i.e. build after a full build without any changes First measue the two variants as is, then install pystemmer and redo the measurements

I suggest to repeat building all four variants 2-3 times to have an estimate on the variation of the build times. This will make it possible to judge wether an observed change in build time with/without pystemmer is significant or just noise.

Essentially, the result should be a table of build times like this (numbers made up):

default with pystemmer
full build 10:36
10:45
10:20
10:15
9:37
9:48
incremental build 1:34
1:36
1:29
1:23
1:29
1:22

timhoffm avatar Oct 10 '24 09:10 timhoffm

Hello, I can try to run this test in the next couple of days, and I will provide the results as soon as possible.

Leoforever123 avatar Oct 22 '24 15:10 Leoforever123

For my computer, the time measurements before and after using PyStemmer are as follows:

  Default Using PyStemmer
Full Build 7.52, 7.44, 8.10 7.01, 6.54, 6.52
Incremental Build 6.14, 5.52, 5.57 5.40, 5.32, 5.28

On average, the full build time without PyStemmer is 7.55, while with PyStemmer it is 6.56, resulting in a speed improvement of approximately 13%. The average incremental build time without PyStemmer is 6.01, and with PyStemmer it is 5.33, showing a speed improvement of around 8%. There may be some measurement errors. Should I open a PR for these results to change the documentation or something similar? @timhoffm

Anson0028 avatar Oct 22 '24 20:10 Anson0028

Thanks @Anson0028 your measurements confirm the ~10% speedup assumption. I believe it's worth adding PyStemmer to our CI setup. Do you want to open a PR for that?

timhoffm avatar Oct 24 '24 07:10 timhoffm

How to add PyStemmer to the CI setup? I want to do it but I don't know how to do it. Sorry to waste your time on this simple question

Anson0028 avatar Oct 24 '24 08:10 Anson0028

Add it to https://github.com/matplotlib/matplotlib/blob/main/requirements/doc/doc-requirements.txt

I think it's also reasonable to add it to https://github.com/matplotlib/matplotlib/blob/main/environment.yml

timhoffm avatar Oct 24 '24 10:10 timhoffm

Hmmm we still build a search index? Would the faster to suppress building the index?

jklymak avatar Oct 24 '24 14:10 jklymak

I don't know that one can prevent building a search index. Do you know how? - This would at least be interesting for measuring how much time is left for PyStemmer.

Also, I often use search in the CI doc build to navigate to changed docs of the PR, so I would be somewhat reluctant to deactivate that.

timhoffm avatar Oct 24 '24 17:10 timhoffm

I took a quick look, and didnt' see a way to deactivate. It'd be great it if could be turned off, and you could use an old index if needed, but I don't really know how the index works, so maybe that is not possible.

jklymak avatar Oct 24 '24 18:10 jklymak

I just opened a PR to add PyStemmer to the environment file. Please let me know if there are any errors.

Anson0028 avatar Oct 25 '24 03:10 Anson0028

I have got the environment installed but when running the command "make html". it generates an error "No module named 'matplotlib.colorizer'". I checked in the local directory and see that this file actuallly exists there. Just wondering why this problem has occurred. @timhoffm

Chengyue-Fei avatar Oct 25 '24 11:10 Chengyue-Fei