mold
mold copied to clipboard
mold concurrency issues with ninja
Mold works great when compiling for a single target binary, but when we do a complete build repo wide (ninja with no target), things start to grind to a halt toward the end of the build when concurrent linking is occurring. It seems like it might be best to turn off concurrent linking when using ninja as ninja is already trying to use available cores for its own parallelism. I have tried disabling mold concurrency with -Wl,--no-threads but I don't think that is correct. Do you have any guidance for ninja+mold?
Thank you for raising an important point. mold is highly parallelized, and that performance characteristic is different from what the ordinary build system expects. Eventually we need to improve ninja so that it adjusts its expectation on linker's CPU usage.
For now, did you try ninja's -l N option? That is an option to tell ninja that it should not spawn new process if the load average is above N.
For the record, I'll leave random thoughts here on what is the problem and what we should do:
- ninja tries to spawn N processes if the machine has N cores, so that subprocesses don't have to compete with each other for CPU resources. If a subprocess is not single-threaded but spawns N threads, the assumption becomes wrong. However, competing just for CPU shouldn't hurt throughput -- ignoring the small overhead of task switching, the total amount of time the build system needs from start to finish should remain the same.
- If subprocesses compete for more memory, that's a real problem. They will try to kick each other out which greatly reduces the throughput. So there's no point of spawning more subprocesses once the cpu usage is saturated.
- Currently, I believe the ninja's scheduling algorithm assumes that a subprocess is single-threaded. We need to tune it for multi-threaded linkers.
- We can also add a new command line option to mold so that mold halts at the beginning of the
mainfunction until the load average is below N. But it doesn't feel like a right thing to do, as it is more like a build system's problem rather than the linker itself.
This might be of interest: https://ninja-build.org/manual.html#ref_pool
The usage example shows how to limit the number of parallel jobs when linking, because linkers themselves are already parallel (and memory hungry).
Cmake already supports assigning link commands to a pool: https://cmake.org/cmake/help/latest/prop_tgt/JOB_POOL_LINK.html and https://cmake.org/cmake/help/latest/prop_gbl/JOB_POOLS.html
A fairly common thing to do is support the make job pool protocol. It is simply a semaphore implemented by a unix pipe:
https://www.gnu.org/software/make/manual/html_node/Job-Slots.html
Unfortunately ninja does not support this, but other tools, such as rust's cargo do (and gnu make of course).
There was a PR to implement jobserver protocol in ninja but it has not been merged. See https://github.com/ninja-build/ninja/pull/1140 https://github.com/ninja-build/ninja/issues/1139
It's also worth mentioning that gcc/ld support the make jobserver for lto.
I think one of the problems of jobserver is that a process that request a new job slot can block indefinitely until the request is satisfied. Requesting a job slot is just writing a byte to a fifo, so if mold wants to reserve 16 jobs for itself for example, then it blocks until 16 bytes are fully written to a fifo. I think that will leave lots of resources unused during compilation.
Minor note, iirc requesting a job is reading a byte, you return the job by writing it back again.
I guess the idea would be to adjust your threadpool dynamically when the jobserver allows you to get a token, not to wait for all the tokens to be available at once.
@glandium Can you do that? I mean, is there any way to know how many bytes you can write to a fifo without writing to it?
@rui314 Unless -j is more than PIPE_BUF I don't see how writing can block, besides, to get a job you read, so you can use select or poll to continuously request job slots and return them with write when you are done. I guess to do it you still need to make the read fd non blocking on read.
https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html#POSIX-Jobserver
@andrewchambers I think you are right (and I confused writing with reading again and again...)
Let me think about whether we should support job server protocol or not. It needs to be resolved in a big picture, and I'm not confident if the job server can solve it entirely, as the CPU is not the only resource that limits compilation speed. Memory is sometimes more strict constraint.
@rui314 I wonder; does mold need all threads to start the same time because they communicate with each other? Otherwise I guess each thread should wait for a token from the make job server pipe separately.
In my experience the only thing that have worked really good for us was back when we used the make job server for everything in our big build. We have a top level Makefile orchestrating the whole build/packaging for different platforms etc. Then we introduced CMake and ninja which does not cooperate with make. After this we have parallel ninja processes running on the same machine called from the top level Makefile. We try to share the cpu-threads between the ninja jobs and not overcommit too much but it does not work that well (both loading too little and too much depending on ccache performance on a particular commit/platform). However the lto linking jobs actually helps even stuff out because ld.bfd creates that temporary makefile that makes the lto-linking obey the job server properly.
Also, in my experience the make load-average flag is superior to the ninja load-average flag. Make tries to estimate the load average when handling jobs but for ninja the cpu load graph looked a comb with spikes as there is a delay calculating load average. With ccache the compilation jobs can be very quick and the kernel cannot keep up I guess. This may have changed nowadays, IDK.
We might be able to add threads to the thread pool when a new thread becomes available via the job server pipe, but I don't think that could make things a lot better. Let's say we have four linker processes. The worst scenario is that the four processes running simultaneously and each process takes 4x more time to finish than it could have done on a silent machine. During the execution of the processes, they consume 4x memory which is the biggest problem.
So, what if we make a change to the linker so that each process gradually adds more threads from the thread pool? Since it is after all CPU constrained, the total execution time wouldn't change that much (besides the overhead of frequent process switches), and the total memory consumption would be the same too.
What we really want to do is to run less number of linker processes simultaneously. If the linker itself is scalable to more number of cores, there's not much point to run them simultaneously in the first place. If we run each linker one by one, it still takes 4x time in total, but the peak memory usage becomes flat instead of 4x.
So, as you can see, the problem is that the make job server's concept doesn't fit very well to mold. Job server assumes that each process is single-threaded and is trying to optimize the CPU usage in that problem space. mold broke that assumption. That's the fundamental problem, I think.
@rui314 If I understand you correctly I agree in most cases. But I think our scenario is slightly different.
We have no control over how many linker processes are running at the same time as we have several ninja-instances for different platforms running at the same time orchestrated from a parent Makefile. It is also sharing the cpu with a number of other packaging scripts, test run scripts, 3pp builds and other things. In these scripts we try to use Makefile:s to parallelize whenever it is needed. And it works very well to keep cpu under control as long as everything is using make. Sometimes we also have multiple of these builds on the same machine where the load-average-flag comes into play.
In order to make the ninja pools work properly I guess you will always need to have a single ninja instance? This can be quite hard to accomplish for bigger build systems with a lot of moving parts. In our case we can have call chains like "Makefile -> ninja * 5 -> test run script -> Makefile -> test suite * 40" or "Makefile -> ninja * 7 -> g++ -> ld.bfd -> Generated Makefile -> Linker thread * 30" for example. The make job server really helps keeping cpu (and indirectly memory) under control during the different build phases.
My point is unless I understood ninja thread pools wrong the make job server is vastly superior to pools for our use case. It just works with recursive make files mixed with scripts. There are some ninja forks with support for make job server and we will probably want to go there. People are forking ninja to get this feature.
The method of dynamically allocating threads when you can read a token could fit in the make model. In practice I guess it is how it works in in the bfd/gold case with the generated temporary lto-makefiles that will not start a new lto-linking-"thread-process" unless make can get a token first. In both cases it will keep the token until that linking thread is done I guess.
Invoking ninja from make seems more like an anti pattern to me.
Agreed. Ninja files are usually generated by CMake (or some other build file generator), so you may be able to generate Makefile and use Make throughout your project.
I agree that make calling ninja can be seen as an antipattern. It would be nice to not have to tune threading for ninja. But in real life it is often not that simple to change...
When migrating to CMake we started using Makefiles but we had to switch to ninja as this would almost halve the build time.
The reason is a combination of ccache and how CMake generated Makefiles which would call the compiler twice; first to get deps and then to compile. This would cause a sysload as high as 4/5 of the cpu due to the excessive file handling when you have a good ccache hit rate (mostly only preprocessing). The generated ninja files would only call the compiler once and get the deps as a sideffect. (It is possible that CMake can generate smarter (gnu) Makefiles nowadays but since then we have started to use the nice ninja dependency tree for queries and stuff and are kind of stuck.)
There are other scenarios where we don't have control over the build system. For example calling 3pp build systems and legacy parts. It would be a lot of work trying to squeeze everything into a single CMake-generated ninja file and then tune pools. We haven't really had any big RAM issues so we haven't had any reason to restrict linking, but we have 384G on our build machines.
I'm just trying to argue that the make job server is for us the adhoc standard that makes it possible to maintain big highly parallellized build system with lots of moving parts where some are out of our control. Make job server is one of the first things I look for in new tools using multithreading. For a linker I look at if it supports gcc lto as well.
For our scenario it looks like mold would fit very nice!!! We rely heavily on ccache so linking is most often the build bottleneck. But this is also the reason why we would not want to restrict linking that much; when ccache has good hit rate we want to use the whole machine for linking. For example, during build of the test platform we link ~1500 so-libs and ~2200 executables (excluding 3pp:s) so linking performance is very important :)
I would like to point out that GNU make has recently added a promising named pipe jobserver. Please see more in the following Ninja post: https://github.com/ninja-build/ninja/issues/1139#issuecomment-1223785608.
I've just implemented an experimental patch but it has an unexpected implication as mold can run GCC with LTO mode that then can't read any jobserver token and we end up with a serial phase. That said, one (mold) needs to actively returns tokens when waiting for LTO :/
@marxin: I'm not sure I follow. Do you mean that job server tokens will be allocated during phases when mold does not need/use the threads?
Quickly looking at the patch it looks like it grabs as many threads as possible from the job token server. I guess this approach can starve other processes in the build. I would kind of hope that you could affect the max threading somehow, preferably also by reading "-j" and "-l" from $MAKEFLAGS.
But to mimic make I guess each started thread should probably grab a token by reading blocking from the token pipe, in practice hang until a token is available. And then put it back when the thread exits. But this is without knowing how threading works in mold, guess it could lock up everything when threads do not start until they get a token. It could possibly be even more fine grained if a thread reads a token when it needs to actually do some processing and writes it back when done processing. But maybe the threading in mold is not some kind of message loop so this model does not fit at all.
@marxin: I'm not sure I follow. Do you mean that job server tokens will be allocated during phases when mold does not need/use the threads?
Yes, at least in my prototype implementation.
Quickly looking at the patch it looks like it grabs as many threads as possible from the job token server. I guess this approach can starve other processes in the build. I would kind of hope that you could affect the max threading somehow, preferably also by reading "-j" and "-l" from $MAKEFLAGS.
Yes, my implementation is naive and uses a greedy approach :/ Note -j from MAKEFLAGS would be also set to e.g. 16, which is a maximum value.
But to mimic make I guess each started thread should probably grab a token by reading blocking from the token pipe, in practice hang until a token is available.
It depends. It's likely better to not wait and start processing "jobs" in a single thread and probably ask for another token once there's another opportunity for threading.
And then put it back when the thread exits. But this is without knowing how threading works in mold, guess it could lock up everything when threads do not start until they get a token. It could possibly be even more fine grained if a thread reads a token when it needs to actually do some processing and writes it back when done processing. But maybe the threading in mold is not some kind of message loop so this model does not fit at all.
Yep, it should likely return it when not needed. But I think the token manipulation should be integrated into the underlying TBB library which knows more about threads spawning and can dynamically increase/decrease it during the run of the mold linker.
Anyway, it's not as trivial as I thought.
If you plan to take this greedy approach you could probably write a helper program that simply takes jobserver jobs and then executes another command (mold in this case) with the right flag for how many jobs it was able to grab.
@andrewchambers: I agree, such a program could be useful in itself. However, in order to make it really useful for efficiently limiting total load during a build while while not starving other jobs a more dynamic approach is needed. But that could prove to be a bit too complex and maybe does not fit the mold threading model at all.
(We had an attempt to use ninja pools for linking for another case where we only have one ninja instance. It was hard to tune and was always slower than the "free threading" approach. The linking jobs vary a lot in time/size so it feels clunky to tune as well.)
I'm glad to see someone else has already suggested the GNU jobserver here. We run into this problem in Gentoo because our package manager is source-based. To get a predictable level of CPU usage, users generally want to configure the number of simultaneous CPU-intensive jobs that can be run at any one time, regardless of the nature of those jobs (compile, link, whatever). Without coordination, you wind up in a situation where every tool wants to fire off N jobs, where N is something like the number of CPUs on the system. But if one of those N jobs launches another N jobs that themselves launch N jobs...
The GNU jobserver is the closest thing to a standard way to coordinate between the different processes, and hopefully ninja will finally adopt it soon.
I'm thinking of something easier to implement and understand. That is, what if we just allow only one instance of mold at a time? There's not much point of running multiple mold instances on a single machine simultaneously. So, if mold is invoked while other mold instance is running, we can just make it to wait until the first process finishes.
How do you suppose that is to work when containers and multiple users are involved? That is, what mechanism will be used to measure "uniqueness"? And why is mold vs. mold the only concern? Are there no other tools that look at nproc to determine their task count? I think interacting with a jobserver that can handle -l (load average) flags and more than one process threatening an NxN problem is more robust than mold trying to solve problems in a little sandbox ignoring the state of the rest of the system.
I'm thinking of something easier to implement and understand. That is, what if we just allow only one instance of mold at a time? There's not much point of running multiple mold instances on a single machine simultaneously. So, if mold is invoked while other mold instance is running, we can just make it to wait until the first process finishes.
I consider this more a matter of predictability than anything else, and even one instance of mold is unpredictable at the moment.
If I have 8 cores, I might (for example) configure my package manager to launch at most two simultaneous jobs via MAKEOPTS, and then expect to be able to run two instances of the package manager without using more than half of my CPUs. That will be more or less true so long as the compiler, linker, and other build tools are single-threaded (or use the jobserver). But if I use ld.mold instead of ld.bfd, both package manager instances will eventually try to link something and mold will use up all of the available CPUs. Having one mold process stall while the other runs would do nothing to change that. What we'd really like is for each mold process to use either one or two threads, depending on how many other jobs are running.
That may not be the optimal use of my resources, but it matches most closely my intent when I set e.g. MAKEOPTS="-j2" before starting a build.
I can of course force mold to use only one thread, but that's overly pessimistic: if at any point linking is all that's happening, I want mold to be able to use as many threads as there are job slots available.
Note that there's also things that mold doesn't (and cannot) know. For example, maybe there's one really expensive link that holds up some large part of the build graph. If it stalls waiting for, say, dozens of test programs to link, you're really hampering the wall clock time of the build. I think any mold-specific solution would also need to consider starvation and unfairness in whatever solution it takes; even if it did, it cannot know which link is "more important" to finish first for the entire task at hand. The jobserver should also be much less contentious since any job mold does get ends up not "fighting" with other work the build tool thinks still the machine still has room for.
Keep in mind that there's no point of running multiple mold instances simultaneous unless your machine has really a lot of cores like 64 or 128. If you share 16 cores with two mold instances, they will take 8 cores each, and each of them takes 2x more time than it would have taken with 16 cores. The peak memory usage is 2x because you are running two processes simultaneously. If you run two processes serially, you can reduce the peak memory usage by half. In general, multi-threaded, scalable programs don't need to be invoked in parallel with others to shorten the overall latency.