sphinx
sphinx copied to clipboard
Restructuring multiprocessing
Is your feature request related to a problem? Please describe. The current implementation of the multiprocessing architecture has some problems, especially with bigger projects (500 - 25.000 pages):.
- The memory consumption can be extremely high and may need 32 GB or more
- The created processes are not running in parallel for some time, which may cost ~30% of build time
I would like to work on this, but need some feedback about concepts and ideas, as the needed changes will affect some part of the architecture.
From some comments and older issues it looks like the multiprocessing part was mainly created to deal with IO waiting times and make the writing more efficient. It looks like parallelizing the computation itself was not one of the main goals.
The reasons for the above problems are:
1. Processes are created to often
Sphinx is calculating chunks based on the number of documents and cores/-j X
.
Each chunk gets its own process, which is started if a prio-process is done and terminated (simply said).
The amount of chunks is calculated in a way, that for bigger projects you get much more chunks as you have configured via -j
.
But this also means, that several new processes are "forked" over the time and get a copy of the raising ENV of the main processes. Process creation costs time.
Solution idea: The amount of chunks should match the amount of -j X
, so that we create long running processes only once (Drawback: The implemented log-collection needs an update, as collecting the logs when a process is done leads to no output to the user for several minutes.)
2. Processes are "forked", not "spawned"
Each forked process gets a copy of the main process, which contains the complete ENV including all document information from already calculated docs.
This means if the main process is using 4 GB of RAM and you are working on an 8 core system (-j 8
), Sphinx will create 7 parallel processes and all of them get a copy of the 4 GB => 4GB + 7*4 GB = 32 GB free RAM needed.
The only solution is to reduce the amount of cores e.g. -j 4
, but that may cost you up to 50% of build performance.
Solution idea: Less process creation. And if possible "spawing" instead of "forking". "Spawing" will not make a memory copy of the main process, so data needs to be given to its child-processes via a pipe. This would be a huge conceptional change. I also found a PR, where "forking" was added for Mac OS X support.
3. Serialized vs Parallel tasks
Before Sphinx starts the parallel tasks, it calculates serial tasks in the main process. The serial calculation of a document is taking often longer as the parallel calculation part for the document.
Also, a process gets started after Sphinx has done all the serial calculations for a number of docs (a chunk.) So if you have 7 chunks, the process for the last chunk gets started when the serial calculation for all of the other chunks in the main process is done, which may take 20min in bigger projects. So at the end the parallel processes are not started at the same time, so that some of the processes are started, some are waiting for getting started, and some are already done with their tasks.
Solution idea: Recheck if certain tasks can be put into parallel execution to keep the amount of serialized Build time as small as possible.
By a given -j X
, Sphinx should create X
parallel processes. Currently, it is X-1
to save one core for the main process. But the main process has nothing to do after starting the last processes, so one core is not used.
Additional context
I already have tested a solution for problem 1, by creating only so many chunks as configured by -j X
.
This already reduces the memory consumption by ~20-30%, as the processes are forked at the beginning when the ENV
is still quite small and so the copied memory.
My background I'm a maintainer of some Sphinx extensions (like Sphinx-Needs) and provide support for some bigger company-internal Sphinx projects in the Automotive industry. There are a lot of API-docs get created and different extensions are used, so that project size and complexity are often really huge. Scalability of Sphinx is a big topic, to keep the build time short for local builds on developer machines. Therefore I already have done a lot of analysis and enhancement for Sphinx extensions and created also Sphinx-Performance for easier technical memory and runtime comparisons of different Sphinx setups.
Question Does a core developer think working on multiprocessing makes sense? Would like to get some thoughts before working on a PR that may not get merged later :)
And sorry for the long text, it is a not so easy topic :)