sphinx icon indicating copy to clipboard operation
sphinx copied to clipboard

Parallel post transform including write_doc_serialized

Open ubmarco opened this issue 1 year ago • 20 comments

Subject: Enable parallel post transformations including doctree-resolved and write_doc_serialized.

This is an extended version of PR https://github.com/sphinx-doc/sphinx/pull/11733. It has basically the same purpose. Sorry for opening a second PR, however I need to run the CI and also want to ask for maintainers' opinion on the work. Maybe there is a show stopper that I do not see yet.

Feature or Bugfix

  • Feature

Purpose

The current parallel writing approach is quite slow for compute intensive extensions in doctree-resolved. The event is part of the post transformation call chain. This PR shifts that workload to the parallel subprocesses. To correctly deal with updated doctrees in post-transform, also the step write_doc_serialized runs in parallel.

Detail

For extensions like Sphinx-Needs this speeds up the build time by at least a factor of 2 for big projects. The new implementation also leads to more processes running in parallel as the serial post transformation of chunks in the main process blocks their parallel execution, so a bottleneck gets resolved. All processes defined by --jobs (minus 1, for the main process) now start at once when writing parallel.

The main difference between https://github.com/sphinx-doc/sphinx/pull/11733 and this PR is how write_doc_serialized is called in case all extensions support parallel post transforming. Here is a summary for master branch, the other PR and this PR:

  • master

    1. main process: post-transform including doctree-resolved
    2. main process: write_doc_serialized
    3. sub processes: write_doc
  • https://github.com/sphinx-doc/sphinx/pull/11733

    1. main process: write_doc_serialized
    2. sub process: post-transform including doctree-resolved
    3. sub process: write_doc

    After my investigation, I now think this approach is broken as it changes the logic. write_doc_serialized does 2 things for the html builder: image handling and building the search index. Both need information generated in post-transform.

  • this PR

    1. sub process: post-transform including doctree-resolved
    2. sub process: write_doc_serialized
    3. sub processes: write_doc
    4. sub process: return relevant information from write_doc_serialized
    5. main process: merge back relevant information from write_doc_serialized

    I think this is the correct approach. The function name write_doc_serialized is not optimal anymore, as it may run in parallel. On the long run I think Sphinx should change the contract with extensions so all of them must support parallel reading and writing.

All Sphinx test cases are green on my local machine. When activating parallel builds for all test cases, only 2 fail:

FAILED tests/test_domain_std.py::test_productionlist - AssertionError: assert 1 == 2
FAILED tests/test_ext_inheritance_diagram.py::test_inheritance_diagram - KeyError: 'basic_diagram'

But they also fail on master, so I think the problems existed before. I also ran all test cases of a new Sphinx-Needs branch that supports the latest Sphinx version against this PR. All are successful, also when run in parallel mode.

Performance In terms of performance, this PR improves build times for bigger Sphinx projects that have expensive doctree-resolved functions significantly by at least a factor of 2. For projects that have cheap doctree-resolved functions, the performance maybe a bit lower than master branch, due to back-merging of artifacts of write_doc_serialized. I measured a runtime increase of 2% for a 1000 pages project.

Memory I haven't done a larger analysis besides my local XFCE task manager. Due to the fact that more processes run in parallel at a certain point in time, the memory consumption with the same core count can be higher than for the master branch.

Risk evaluation The new parallel post-transformation is activated using a new global configuration variable enable_parallel_post_transform which is False by default. The feature is also marked as experimental. Only projects/builds that know all extensions support parallel post transformation for all extensions should activate it. As the behavior for enable_parallel_post_transform=False is not changed, the risk of this PR is low.

Conclusion As a conclusion, I think this PR is a candidate for merging as it behaves correctly in terms of application logic while https://github.com/sphinx-doc/sphinx/pull/11733 should be closed unmerged after confirmation of my assessment.

Any feedback is very welcome!

ubmarco avatar Nov 03 '23 22:11 ubmarco

Quick comment (still busy I am):

  • I would prefer to have this feature disabled by default and possibly enabled in Sphinx 9.x (have at least 1 major version for upgrading).
  • With the feature disabled by default, the PR could be smaller. And there would be less risks (I clearly don't want to risk breaking projects with the latest Sphinx version).

I have no time to contribute to Sphinx until February so I can only comment things here and there.

I think it's a good approach though. However we need to carefully review what happens and mark the feature as experimental imo.

picnixz avatar Nov 07 '23 08:11 picnixz

I like the PR and also understand the concerns of @picnixz.

For me, Sphinx should get a configuration variable, like parallel_post_transform. If this is True the parallel execution of this PR gets activated. By default it is currently False, with the plan to set it to True with version 9.x

With Sphinx 8.x maybe a Deprecation warning can be shown, that with version 9.x an extension needs to provide parallel_post_transform_safe. If this is not the case, it will be set to True for the specific version (from 9.x on.)

But I would keep the already provided changes for the internal extensions, as @ubmarco has already tested all of them and he knows that setting parallel_post_transform_safe to True is safe for them. Please keep in mind that 54 of the 61 files just got this line as change 'parallel_post_transform_safe': True,. So the PR touches 2-3 files with more complex stuff only.

danwos avatar Nov 07 '23 12:11 danwos

For me, Sphinx should get a configuration variable, like parallel_post_transform. If this is True the parallel execution of this PR gets activated. By default it is currently False, with the plan to set it to True with version 9.x

That's a good idea.

has already tested all of them and he knows that setting parallel_post_transform_safe to True is safe for them.

While tests show indeed that the built-in extensions are safe, I don't think we have a 100% coverage so there might be corner cases where we could be surprised. Don't get me wrong here, I think it's fine if we enable this for the default extensions. However, I cannot assure you that we won't break any existing projects that we are not aware of (especially those that automatically update to the latest version of their dependencies).

With the configuration value, we would also reduce the lines that this PR touches (although it's not much).

picnixz avatar Nov 07 '23 14:11 picnixz

Relevant issues: https://github.com/sphinx-doc/sphinx/issues/10779 and https://github.com/sphinx-doc/sphinx/issues/11448

ubmarco avatar Nov 14 '23 13:11 ubmarco

@picnixz I refactored the PR as discussed.

  • There is a new global configuration parameter enable_parallel_post_transform which is False by default
  • The touched files went down
  • Docs were added with the new feature marked as experimental
  • CI is still happy

The only thing missing is a test case from my point of view.

ubmarco avatar Nov 16 '23 11:11 ubmarco

Please ensure that this does not make race-conditions worse - see issue #2946

bmwiedemann avatar Dec 10 '23 14:12 bmwiedemann

#2946 is for me not fulfilling the definition of "race -conditions" because Sphinx gets called twice from external, with the same config of doctree-location. Even MS Word and others have the same problem if the tools get called twice for the same file. Instance "A" is normally not informed by instance "B" about a documentation change. And "A" would overwrite the results from "B", in most cases. For me, the ticket asks more for a database behavior, so to support concurrent data-changes. Nothing that Sphinx needs to provide (imho). Also, the workaround is simple, without losing performance: Let the first builder use -j auto to get the build done on 100% CPU usage. Then invoke the second builder, which will reuse the doctress from the first run and nearly calculates the new output format only.

Internal race-conditions should not be a thing in Sphinx, as long as the code dealing with one file is called serial. What is still the case here. Only merging data back from concurrent doctree-handlers to the main ENV object is something you need to care about and avoid "race-conditions", but this is also handled by Sphinx and some extra small voodoo in this PR.

However, in my understanding, this PR does not make it worse, as it is already. :)

danwos avatar Dec 10 '23 19:12 danwos

There are also issues with sphinx-build -j auto (e.g. in qemu, kernel-docs)

bmwiedemann avatar Dec 10 '23 19:12 bmwiedemann

I refactored the PR as discussed.

Thanks for that. Now I'm only available for a few days before going offline until mid February. Today I cleaned up what I needed to do but there are a lot of ongoing issues that we need to address (those related to test failures for instance).

Also, on the organisational side we are doing something internally so we won't be doing anything until everything is arranged and done (probably early January). As such, you'll need to wait again (sorry for that).

picnixz avatar Dec 24 '23 19:12 picnixz

@ubmarco I don't seem to be able to push to this branch, would you be able to check please?

https://github.com/AA-Turner/sphinx/tree/mh-parallel-post-transform-parallel-write-doc-serial

A

AA-Turner avatar Jan 14 '24 05:01 AA-Turner

@ubmarco I don't seem to be able to push to this branch, would you be able to check please?

https://github.com/AA-Turner/sphinx/tree/mh-parallel-post-transform-parallel-write-doc-serial

A

I think the problem is that I created the fork on our organization and not on my user. According to the docs allowing edits for maintainers only works for forks in personal accounts. Will keep that in mind for the future. I rebased my branch on current master (prefer this for cleaner history) and added your style commit.

ubmarco avatar Jan 15 '24 14:01 ubmarco

I've added a new commit (https://github.com/AA-Turner/sphinx/commit/14899132a113be09c166acde7440a1ef4c69888a) in https://github.com/AA-Turner/sphinx/tree/mh-parallel-post-transform-parallel-write-doc-serial.

I'm hesitant about adding new public API (merge_builder_post_transform / post_transform_merge_attr), as we'll need to support it for the long term. But I don't have any proposals on an alternate design at the moment, so it might be the best option.

A

AA-Turner avatar Jan 18 '24 04:01 AA-Turner

I integrated your Refactoring commit.

ubmarco avatar Jan 19 '24 16:01 ubmarco

I've tried the sphinx version of this PR in large projects (3000 - 30.000 rst files), and it really speeds up things. I'd be happy to see this in the next sphinx release!

arwedus avatar Mar 05 '24 08:03 arwedus

@arwedus Could you also give me the runtimes when

  • this PR + #11939 are enabled,
  • this PR is enabled,
  • #11939 is enabled,
  • none of the two PRs are enabled?

I don't have a huge project in production to test against and thus I'd like some feedback from projects as yours.

picnixz avatar Mar 05 '24 17:03 picnixz

Oh my, these latest changes are not supposed to be pushed to this PR, it's for my local testing. Sorry, will revert them.

ubmarco avatar Mar 15 '24 11:03 ubmarco

I reverted the unintended changes of my last commit and rebased the PR on latest master.

ubmarco avatar Mar 15 '24 16:03 ubmarco

@arwedus

Unfortunately, I can't test this right now, because if I install sphinx with

pip install git+https://github.com/useblocks/sphinx.git@69a77c7

I get this error by sphinx:

sphinx.errors.SphinxWarning: Failed to convert typing.Any to a set or tuple

I've tried with verbose but the log is not super useful:

  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/config.py", line 514, in _validate_valid_types
    valid_types = tuple(valid_types)
  File "/usr/lib/python3.10/typing.py", line 312, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3.10/typing.py", line 403, in __getitem__
    return self._getitem(self, parameters)
  File "/usr/lib/python3.10/typing.py", line 425, in Any
    raise TypeError(f"{self} is not subscriptable")
TypeError: typing.Any is not subscriptable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/cmd/build.py", line 337, in build_main
    app = Sphinx(args.sourcedir, args.confdir, args.outputdir,
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/application.py", line 229, in __init__
    self.setup_extension(extension)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/application.py", line 402, in setup_extension
    self.registry.load_extension(self, extname)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/registry.py", line 466, in load_extension
    metadata = setup(app)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/myst_parser/__init__.py", line 12, in setup
    setup_sphinx(app, load_parser=True)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/myst_parser/sphinx_ext/main.py", line 59, in setup_sphinx
    app.add_config_value(f"myst_{name}", default, "env", types=Any)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/application.py", line 541, in add_config_value
    self.config.add(name, default, rebuild, types)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/config.py", line 434, in add
    valid_types = _validate_valid_types(types)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/config.py", line 516, in _validate_valid_types
    logger.warning(__('Failed to convert %r to a set or tuple'), valid_types)
  File "/usr/lib/python3.10/logging/__init__.py", line 1847, in warning
    self.log(WARNING, msg, *args, **kwargs)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/util/logging.py", line 131, in log
    super().log(level, msg, *args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1879, in log
    self.logger.log(level, msg, *args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1547, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 964, in handle
    rv = self.filter(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 821, in filter
    result = f.filter(record)
  File "/home/.../.sphinx-venv/lib/python3.10/site-packages/sphinx/util/logging.py", line 432, in filter
    raise exc
sphinx.errors.SphinxWarning: Failed to convert typing.Any to a set or tuple

@chrisjsewell might be interested in this, because it seems to be caused by myst-parser 2.0. Breathe is also an extension I had to deactivate. But if I deactivate myst-parser, I don't have a very large project to test any more 😆

arwedus avatar Mar 23 '24 21:03 arwedus

@picnixz : With my patches to both myst-parser and breathe, I managed to run the build on my rather big and complicated sphinx project.

It stands at 721 source files, contains lots of sphinx-needs, intersphinx links, plantUML, dynamic functions, sphinx-autoapi and a dozen other extensions.

Build times for clean build (deleted build director):

sphinx 7.3, parallel_post_transform = true, python 3.10:

real 4m53,530s user 21m12,846s sys 1m15,931s

sphinx 7.2.6, python 3.10:

real 7m54,836s user 14m34,006s sys 1m20,964s

I tested on a beefy machine, where any improvement in parallelism should bring down the wall time: 36-core Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz, 128 GB RAM.

Consecutive runs gave similar results. The build with sphinx 7.2.6 threw some unexplainable errors, so the numbers for user time might be a bit off. The results clearly show improved parallelism. I wonder about the increase in total CPU time ("user"), might be due to issues with the sphinx 7.2.6 build.

arwedus avatar Mar 24 '24 21:03 arwedus

@arwedus Thank you very much for the timings. If possible, could you also time your project without parallel_post_transform (i.e., HEAD vs 7.2.6 release) for an incremental build time (an perhaps give me the numbers for a fresh build as well). I would like to know the direct consequences of https://github.com/sphinx-doc/sphinx/pull/11939 (if possible, could you report them on https://github.com/sphinx-doc/sphinx/pull/11939 instead of this PR?)

picnixz avatar Mar 25 '24 08:03 picnixz