sphinx
sphinx copied to clipboard
Oprimize writing step of parallel builds
Is your feature request related to a problem? Please describe.
During the writing step of the build process, even while using -j auto, there are some time-consuming operations that are performed serially. It might be possible to parallelize them and further improve performances.
Describe the solution you'd like
Last night I went down a rabbit hole and tried to optimize _write_parallel:
https://github.com/sphinx-doc/sphinx/blob/8c4865c30d5fa847d727fea16519d7afce627932/sphinx/builders/init.py#L571-L607
I tried to naively apply the following changes:
diff --git a/sphinx/builders/__init__.py b/sphinx/builders/__init__.py
index 2aede5c24..d27df5fdd 100644
--- a/sphinx/builders/__init__.py
+++ b/sphinx/builders/__init__.py
@@ -569,9 +569,11 @@ class Builder:
self.write_doc(docname, doctree)
def _write_parallel(self, docnames: Sequence[str], nproc: int) -> None:
- def write_process(docs: List[Tuple[str, nodes.document]]) -> None:
- self.app.phase = BuildPhase.WRITING
- for docname, doctree in docs:
+ def write_process(docs: List[str]) -> None:
+ for docname in docs:
+ doctree = self.env.get_and_resolve_doctree(docname, self)
+ self.app.phase = BuildPhase.WRITING
+ self.write_doc_serialized(docname, doctree)
self.write_doc(docname, doctree)
# warm up caches/compile templates using the first document
@@ -595,12 +597,7 @@ class Builder:
self.app.phase = BuildPhase.RESOLVING
for chunk in chunks:
- arg = []
- for docname in chunk:
- doctree = self.env.get_and_resolve_doctree(docname, self)
- self.write_doc_serialized(docname, doctree)
- arg.append((docname, doctree))
- tasks.add_task(write_process, arg, on_chunk_done)
+ tasks.add_task(write_process, chunk, on_chunk_done)
# make sure all threads have finished
tasks.join()
This resulted in a build time reduction of ~25%.
Without patch, using Sphinx v4.5.0 to build the python/cpython docs:
real 2m13,386s
user 4m35,046s
sys 0m7,205s
With patch:
real 1m36,804s
user 4m42,314s
sys 0m5,909s
Despite the performance improvements, this solution is wrong for (at least) a few reasons:
- while it mostly works, some things (e.g. images) break;
write_doc_serializedwas explicitly created for executing code that can't be parallelized, so it shouldn't be called in a process executed in parallel;
However, write_doc_serialized was added ~10 years ago in 5cd0841e5f041f3ef03840fafac425654a48b40d to fix "fix parallel build globals problems". At the time the parallelization also relied on threading, whereas now it seems entirely based on multiprocessing. In builders/__init__.py it's defined as an empty method to be overridden, and the html builder overrides it with:
https://github.com/sphinx-doc/sphinx/blob/8c4865c30d5fa847d727fea16519d7afce627932/sphinx/builders/html/init.py#L673-L678
In addition, I noticed that _read_parallel seems to do some post-processing/merging in
https://github.com/sphinx-doc/sphinx/blob/8c4865c30d5fa847d727fea16519d7afce627932/sphinx/builders/init.py#L467-L475
At this point, my lack of knowledge of Sphinx internals prevented me to dig deeper, but I was left wondering:
- if the fix added in 5cd0841e5f041f3ef03840fafac425654a48b40d is still relevant after moving away from
threading, or if it can be revisited/reverted (at least partially); - if the issue fixed by the above commit can be addressed with some post-processing similar to the one used in
_read_parallel; - if any of the operations currently executed serially by
write_doc_serializedcan be parallelized;
Perhaps someone more familiar with the Sphinx internal can take a look and confirm whether there is room for improvement or not?
Addressing this (especially 2. above) might also help fix the following issue (cc @tk0miya, since you worked on this code):
- #4459
Thank you for the detailed write-up Ezio, this is very useful.
some things (e.g. images) break
What do you mean by this?
A
I didn't dig too much into this, but directives like .. image:: seem unable to correctly find/display the right image. Duplicate labels might break too as suggested by the issue linked above.
I tried an implementation in https://github.com/sphinx-doc/sphinx/pull/11746.
For the parts that needs coordination/back-merging to the main process such as image handling and search index builder, a builder callback function merge_builder_post_transform was added.
It fixes the broken images and the search index for the HTML builder and the hyperlinks for the CheckExternalLinksBuilder.
@ezio-melotti If you want to try it out, just use Sphinx of my PR and add enable_parallel_post_transform = True to your conf.py. I'd be interested in your feedback.
I tried your branch (from #11746) with the documentation of the python/cpython repo and it's 2-3 times faster: from 40s to 14s (!). This is with -j auto on a 16 cores CPU. From a cursory inspection of the output, everything looks ok, including images (I haven't checked labels).
With another repo (the CPython Developer's guide), I saw no significant changes (2.9s with and without your changes), until I noticed that the Makefile didn't include -j auto. After adding it, it went from 1.4s (without your changes) to 0.9s (with your changes).
Let me know if you want me to perform any other test, and thanks for working on this!
Click to see the list of steps used to test
$ cd Doc
$ make venv
$ time make html
...
real 0m39.435s
user 1m20.681s
sys 0m3.226s
$ make clean
rm -rf ./venv
rm -rf build/*
$ make venv
$ source venv/bin/activate
$ pip install -e ../sphinx # with your branch checked out
$ deactivate
$ time make html
...
real 0m39.170s
user 1m21.934s
sys 0m3.098s
$ vim conf.py # added enable_parallel_post_transform = True
$ rm -rf build/*
$ time make html
...
real 0m14.527s
user 1m29.697s
sys 0m3.619s
I also get nearly 2x with the CPython docs on macOS M2 with 10 cores. Here, the last number is the waiting time for the user, so from ~45s to ~24s:
With unpinned Sphinx in requirements.txt:
make html 79.05s user 3.65s system 184% cpu 44.875 total
With the PR branch (git+https://github.com/useblocks/sphinx.git@mh-parallel-post-transform-parallel-write-doc-serial in requirements.txt) and enable_parallel_post_transform = True in conf.py:
make html 89.40s user 4.27s system 396% cpu 23.641 total
And with the PR branch with no enable_parallel_post_transform set:
make html 80.15s user 3.96s system 186% cpu 45.209 total