sphinx Oprimize writing step of parallel builds

Is your feature request related to a problem? Please describe. During the writing step of the build process, even while using -j auto, there are some time-consuming operations that are performed serially. It might be possible to parallelize them and further improve performances.

Describe the solution you'd like Last night I went down a rabbit hole and tried to optimize _write_parallel: https://github.com/sphinx-doc/sphinx/blob/8c4865c30d5fa847d727fea16519d7afce627932/sphinx/builders/init.py#L571-L607

I tried to naively apply the following changes:

diff --git a/sphinx/builders/__init__.py b/sphinx/builders/__init__.py
index 2aede5c24..d27df5fdd 100644
--- a/sphinx/builders/__init__.py
+++ b/sphinx/builders/__init__.py
@@ -569,9 +569,11 @@ class Builder:
                 self.write_doc(docname, doctree)
 
     def _write_parallel(self, docnames: Sequence[str], nproc: int) -> None:
-        def write_process(docs: List[Tuple[str, nodes.document]]) -> None:
-            self.app.phase = BuildPhase.WRITING
-            for docname, doctree in docs:
+        def write_process(docs: List[str]) -> None:
+            for docname in docs:
+                doctree = self.env.get_and_resolve_doctree(docname, self)
+                self.app.phase = BuildPhase.WRITING
+                self.write_doc_serialized(docname, doctree)
                 self.write_doc(docname, doctree)
 
         # warm up caches/compile templates using the first document
@@ -595,12 +597,7 @@ class Builder:
 
         self.app.phase = BuildPhase.RESOLVING
         for chunk in chunks:
-            arg = []
-            for docname in chunk:
-                doctree = self.env.get_and_resolve_doctree(docname, self)
-                self.write_doc_serialized(docname, doctree)
-                arg.append((docname, doctree))
-            tasks.add_task(write_process, arg, on_chunk_done)
+            tasks.add_task(write_process, chunk, on_chunk_done)
 
         # make sure all threads have finished
         tasks.join()

This resulted in a build time reduction of ~25%.

Without patch, using Sphinx v4.5.0 to build the python/cpython docs:

real    2m13,386s
user    4m35,046s
sys     0m7,205s

With patch:

real    1m36,804s
user    4m42,314s
sys     0m5,909s

Despite the performance improvements, this solution is wrong for (at least) a few reasons:

while it mostly works, some things (e.g. images) break;
write_doc_serialized was explicitly created for executing code that can't be parallelized, so it shouldn't be called in a process executed in parallel;

However, write_doc_serialized was added ~10 years ago in 5cd0841e5f041f3ef03840fafac425654a48b40d to fix "fix parallel build globals problems". At the time the parallelization also relied on threading, whereas now it seems entirely based on multiprocessing. In builders/__init__.py it's defined as an empty method to be overridden, and the html builder overrides it with: https://github.com/sphinx-doc/sphinx/blob/8c4865c30d5fa847d727fea16519d7afce627932/sphinx/builders/html/init.py#L673-L678

In addition, I noticed that _read_parallel seems to do some post-processing/merging in https://github.com/sphinx-doc/sphinx/blob/8c4865c30d5fa847d727fea16519d7afce627932/sphinx/builders/init.py#L467-L475

At this point, my lack of knowledge of Sphinx internals prevented me to dig deeper, but I was left wondering:

if the fix added in 5cd0841e5f041f3ef03840fafac425654a48b40d is still relevant after moving away from threading, or if it can be revisited/reverted (at least partially);
if the issue fixed by the above commit can be addressed with some post-processing similar to the one used in _read_parallel;
if any of the operations currently executed serially by write_doc_serialized can be parallelized;

Perhaps someone more familiar with the Sphinx internal can take a look and confirm whether there is room for improvement or not?

Addressing this (especially 2. above) might also help fix the following issue (cc @tk0miya, since you worked on this code):

#4459

Aug 19 '22 16:08 ezio-melotti

Thank you for the detailed write-up Ezio, this is very useful.

some things (e.g. images) break

What do you mean by this?

A

Aug 28 '22 19:08 AA-Turner

I didn't dig too much into this, but directives like .. image:: seem unable to correctly find/display the right image. Duplicate labels might break too as suggested by the issue linked above.

Aug 30 '22 21:08 ezio-melotti

I tried an implementation in https://github.com/sphinx-doc/sphinx/pull/11746. For the parts that needs coordination/back-merging to the main process such as image handling and search index builder, a builder callback function merge_builder_post_transform was added. It fixes the broken images and the search index for the HTML builder and the hyperlinks for the CheckExternalLinksBuilder. @ezio-melotti If you want to try it out, just use Sphinx of my PR and add enable_parallel_post_transform = True to your conf.py. I'd be interested in your feedback.

Nov 15 '23 20:11 ubmarco

I tried your branch (from #11746) with the documentation of the python/cpython repo and it's 2-3 times faster: from 40s to 14s (!). This is with -j auto on a 16 cores CPU. From a cursory inspection of the output, everything looks ok, including images (I haven't checked labels).

With another repo (the CPython Developer's guide), I saw no significant changes (2.9s with and without your changes), until I noticed that the Makefile didn't include -j auto. After adding it, it went from 1.4s (without your changes) to 0.9s (with your changes).

Let me know if you want me to perform any other test, and thanks for working on this!

Click to see the list of steps used to test

$ cd Doc
$ make venv
$ time make html
...
real    0m39.435s
user    1m20.681s
sys     0m3.226s
$ make clean
rm -rf ./venv
rm -rf build/*
$ make venv
$ source venv/bin/activate
$ pip install -e ../sphinx  # with your branch checked out
$ deactivate
$ time make html
...
real    0m39.170s
user    1m21.934s
sys     0m3.098s
$ vim conf.py  # added enable_parallel_post_transform = True
$ rm -rf build/*
$ time make html
...
real    0m14.527s
user    1m29.697s
sys     0m3.619s

Nov 15 '23 23:11 ezio-melotti

I also get nearly 2x with the CPython docs on macOS M2 with 10 cores. Here, the last number is the waiting time for the user, so from ~45s to ~24s:

With unpinned Sphinx in requirements.txt:

make html  79.05s user 3.65s system 184% cpu 44.875 total

With the PR branch (git+https://github.com/useblocks/sphinx.git@mh-parallel-post-transform-parallel-write-doc-serial in requirements.txt) and enable_parallel_post_transform = True in conf.py:

make html  89.40s user 4.27s system 396% cpu 23.641 total

And with the PR branch with no enable_parallel_post_transform set:

make html  80.15s user 3.96s system 186% cpu 45.209 total

Nov 16 '23 17:11 hugovk

sphinx sphinx copied to clipboard

Oprimize writing step of parallel builds

sphinx
sphinx copied to clipboard