std.Thread.Pool: process tree cooperation
Problem statement:
- Parent process creates a
std.Thread.Poolof sizenumber_of_logical_cores - Child process creates a
std.Thread.Poolof sizenumber_of_logical_cores - Now there are
2*number_of_logical_coresthreads active, cache thrashing which harms performance.
On POSIX systems, std.Thread.Pool should integrate by default with the POSIX jobserver protocol, meaning that if the jobserver environment variable is detected, it should coordinate with it in order to make the entire process tree share the same number of concurrently running threads.
On macOS, maybe we should use libdispatch instead. It's bundled with libSystem so it's always available and accomplishes the same goal.
I'm not sure what to do on Windows.
This is primarily a standard library enhancement, however the build system and compiler both make heavy usage of std.Thread.Pool so they would observe behavior changes.
In particular, running zig build as a child process from make would start cooperating and not create too many threads. Similarly, running the zig compiler from the zig build system would do the same. The other way is true too - running make as a child process from the zig build system would start cooperating. And then there are other third party tools that have standardized on the POSIX jobserver protocol, such as cargo.
There is one concern however which is that the protocol leaves room for "thread leaks" to occur if child processes crash. I'm not sure the best way to mitigate this. The problem happens when a child process has obtained a thread token from the pipe, and then crashes before writing the token back to the pipe. In such case the thread pool permanently has one less thread active than before, which is suboptimal, and would cause a deadlock if it happened a number of times exceeding the thread pool size.
Related:
- #12101
Alternative POSIX strategy based on advisory record locks:
Root process opens a new file with shm_open and writes N bytes to the file, where N is thread pool size. This file descriptor is passed down the process tree. Each byte in the file represents an available logical core. Each thread in each process's thread pool holds an advisory record lock (fnctl(F_SETLKW)) on the corresponding byte while working.
Advisory record locks are automatically released by the kernel when a process dies.
This unfortunately would mean that Zig could not play nicely with other applications such as make and cargo. But I think it's worth it. The whole point of the Zig project is to improve status quo, otherwise we could all just keep using C. The make jobserver protocol is fundamentally broken, so what Zig will do is step up and create a better protocol that has similar low overhead but not this glaring problem. Then it will be up to make and cargo to upgrade to the better standard.
As for the strategy I outlined above, I have not evaluated its efficacy yet. I'll report back with some performance stats when I do.
Alternative POSIX strategy based on UNIX domain sockets:
Root process listens on a unix domain socket. That fd is passed down the process tree. To get a thread token a child process connects to the socket. To return the token, disconnect. The root process only accepts N concurrent connections and stops accepting when it's maxed out.
When a child process dies, the OS causes the connection to end, allowing the root process to accept a new connection.
2 upsides compared to the other proposed idea:
- Operating systems are likely not optimized for very large numbers of threads waiting to lock different bytes in the same file.
- The advisory record lock strategy requires each process to have a threadpool, while the make jobserver protocol allows for lazy thread spawning. This strategy as well allows for lazy thread spawning.
Alternative pipes proposal (also incompatible with jobserver protocol):
This protocol overcomes the issue of make-jobserver protocol where it's impossible for server to tell when a child is taking a job (ie: read 1-byte). We overcome this by having the child notify the server before taking a job, so the server knows it needs to write 1-byte for a waiting child on their private pipe.
pros:
- portability
- reliably detect when child dies
cons:
- filesystem activity for named-pipes has to be managed carefully
- more complex than unix domain socket alternative
PIPE OVERVIEW:
- 1 global pipe shared between server and all children, this is a mux-pipe, server is reader, children are writers
- 1 unique named pipe between server and client; that is each server-client have private pipe opened in a coordinated fashion by client and server, server is writer, child is reader
IPC setup
- server open
tmpdirkeeping handle open and allowing inherit - server create
mux-pipe, parent is reader, child(s) are writers - child create
job-pipenamed fifo using inheritedtmpdirand randomized name, maybe a pattern like "[RANDOM].[PID]" where random is the important part and pid is just for semi-reliable knowledge -
job-pipeis write for parent, read for child -
job-pipename/basename is the child "key" which:- is unique amongst all children in the tree
- used as the basename path component of the named fifo
- used as the first parameter for
mux-pipemessaging
job flow
- child: send
mux-pipemessage{ child, checkout }where the first param is the child-key - server: recv
mux-pipemessage{ child, checkout }, open (and remember handle of) named fifo of child-key - server write 1 byte to
job-pipegranting the job and the child will not proceed with job until this is done - child then must read 1-byte on
job-pipebefore starting job - child: do job
- child: send
mux-pipemessage{ child, return} - server: recv
mux-pipemessage{ child, return } - server: poll
job-pipefor other-end pipe closure, and free outstanding checkouts that haven't been returned
cleanup
- child unlink the named pipe when done; race is ok
- server unlink the named pipe on other-end pipe closure; race is ok
IPC and POSIX jobserver interaction for coordinating threaded work seem inefficient. Could it all be avoided by making zig commands (i.e. build-exe, test, etc.) normal functions accepting string args, then running them as tasks in a global Thread.Pool that they (+ build_runner.zig/test_runner.zig) can reach into?
After some discussion and design work with Cargo folks and the rest of the Zig team, I've put together a proposed protocol called the "Robust Jobserver" which aims to solve the "crashing child process" problem of the GNU Jobserver. As written, it supports Windows and most(?) POSIX systems.
Draft: https://codeberg.org/mlugg/robust-jobserver/src/branch/main/spec.md
I'd be interested to hear if anyone has thoughts or concerns regarding that specification. Keep in mind that the goal is that this system be useful not only to the Zig project, but also to other compilers, build systems, etc (really, anything which needs to coordinate CPU-bound work across a process tree). If there aren't any serious concerns, I intend to work on supporting it in the Zig compiler and build system at some point soon.