support and eventually upgrade to an alternate jobserver protocol standard that does not leak threads, in collaboration with zig
Problem
The GNU make jobserver protocol leaks threads if a process dies while holding a token. For example:
- Parent process creates the pipe and puts 2 tokens into the pipe, then spawns children A and B.
- Child Process A reads a token and starts doing work, but then crashes before putting the token back.
- Child Process B has 100 tasks to do, but only proceeds to do them using 1 logical core, leaving an entire core idle because it has been "leaked" by Child Process A.
It could get even worse - if another leak occurred in this scenario, the entire process tree would deadlock.
The fundamental problem is that the protocol relies on a process manually releasing resources before it dies, rather than relying on kernel-managed resources that are automatically released when the process dies.
Proposed Solution
Let's work together on a standard protocol that does not have this flaw, and then pressure GNU make, Ninja, and other third party projects to support it in future versions.
Here is a fully specified proposal that has proof-of-concept already implemented for POSIX and Windows:
https://codeberg.org/mlugg/robust-jobserver/src/branch/main/spec.md
The idea would be to deprecate the legacy GNU make jobserver protocol and try to standardize various open source projects on the better protocol. In the beginning, it could be merely supported when the parent process implements this new standard, and have a flag for choosing it when Cargo is the root process in the tree. Later on, if the new standard achieves widespread adoption, then the default could change.
Notes
If there is interest, I will follow up here with performance data points and other observations as I evaluate this strategy in the Zig project.
Happy hacking!
cc @NobodyXu you might be interested.
Interesting proposal but I have a few questions:
- how would you find a byte to lock?
- how would you block a process until a token is available?
- how to implement this on windows?
Sorry if I sound discouraging, but I am just reading your proposals and is thinking about how to implement it.
- how would you find a byte to lock?
Each process that wants to participate spawns a thread pool of size N where N is the byte size of the file. The threads are numbered sequentially from 0..N, each corresponding to a byte in the file.
- how would you block a process until a token is available?
Lock:
struct flock options = {
.l_type = F_WRLCK,
.l_whence = SEEK_SET,
.l_start = thread_index,
.l_len = 1,
};
fcntl(fd, F_SETLKW, &options);
Unlock:
struct flock options = {
.l_type = F_UNLCK,
.l_whence = SEEK_SET,
.l_start = thread_index,
.l_len = 1,
};
fcntl(fd, F_SETLKW, &options);
- how to implement this on windows?
Out of scope. A completely different strategy probably.
This scheme seems suspiciously quadratic in memory usage, with a high constant factor. From https://unix.stackexchange.com/questions/708104/what-is-the-minimum-amount-of-memory-required-for-starting-a-process-on-a-linux, each thread costs at least one 10 KB task_struct, plus at least 4 KiB stack, so 128 jobs * 128 threads/job * ~16 KiB/thread = ~262 MiB. Even if memory is cheap, that's way too much to spend on jobserver synchronization.
Also, the kernel is probably not optimized for very large numbers of threads waiting to lock different bytes in the same file. I wouldn't be surprised if there's a linked list of all waiters on each file, because it would use less memory than a tree in normal usage.
Hmm, I'm not following your math. Is this example a computer that has 128 logical cores? In this case it would be 128 threads per process. So, 16 KiB/thread * 128 threads = 2 MiB per process. In a more typical case of 32 logical cores it would be 16 KiB/thread * 32 threads = 512 KiB per process.
Edit: I think I understand now - your 128 jobs number is the number of processes alive in the tree at the same time. In this case the math checks out.
I think I reach a different conclusion than you however. I think 128 simultaneous active processes would realistically only occur on a system that has a high number of logical cores in the first place. A system having 128 logical cores will have a high amount of memory to match, making the 262 MiB number a tiny fraction of the whole. Meanwhile, a machine with less cores and ram will need less ram and probably spawn fewer simultaneous processes.
Real world examples, using your numbers for simultaneous proccess count:
- zig's 80-core aarch64 CI server (251 GiB RAM): 80 simultaneous processes * 80 threads/process * 16 KiB/thread = 100 MiB (0.04%)
- my AMD Ryzen 9 desktop (32 logical cores, 62 GiB RAM): 32 simultaneous processes * 32 threads/process * 16 KiB/thread = 16 MiB (0.03%)
- a computer with 4 cores and 2 GiB RAM: 4 simultaneous processes * 4 threads/process * 16 KiB/thread = 256 KiB (0.01%)
Either way, it's a good point to note - status quo allows for spawning threads lazily after receiving a token. The proposed protocol requires a preallocated thread pool.
Your other point is important as well. I'll do some experiments and post the results.
Also, the kernel is probably not optimized for very large numbers of threads waiting to lock different bytes in the same file. I wouldn't be surprised if there's a linked list of all waiters on each file, because it would use less memory than a tree in normal usage.
It appears to me this hunch is correct. In the Linux source code, for example, in fs/locks.c, posix_lock_inode, I can see that it uses list_for_each_entry to iterate over locks, and based on a cursory glance it seems like that list is per-file.
It's disappointing. I haven't found any other primitives that might satisfy this use case. 24 million lines of C code in this source tree, 1.3 million commits, 23 years of heavy development, and yet we as users can't even batch work across a process tree correctly.
Alright, never mind, there was a solution right in front of my face. Here's an alternate scheme:
Root process listens on a unix domain socket. The address is passed down the process tree via environment variable. To get a thread token a child process connects to the socket. To return the token, disconnect. The root process only accepts N concurrent connections and stops accepting when it's maxed out.
When a child process dies, the OS causes the connection to end, allowing the root process to accept a new connection.
Yeah the unix socket sounds plausible, it's also cross platform and can be doable on windows and other OSes.
It adds a bit more work to the "root process", where as the make jobserver doesn't need a server, but I think this is a reasonable design and in practice you always need a "root process" to keep it alive anyway.
The socket connection can be either blocking or non-blocking, so usable in sync or async function.
Compared to jobserver, it can't get the available tokens, though I guess that can be implemented via an extension by sending message to the server.
Also, I think the address of the unix socket would need to be passed as environment variables?
E.g. JOBSERVER=unix-socket:...
Compared to jobserver, it can't get the available tokens, though I guess that can be implemented via an extension by sending message to the server.
What's the actual use case for such a feature? I'm not claiming there isn't one, I just don't immediately see it. AFAICT, any process spawning jobs can just use a thread pool capped at the number of logical cores; perhaps there will be extra threads blocked at all times, but that shouldn't be a problem.
Also, I think the address of the unix socket would need to be passed as environment variables? E.g.
JOBSERVER=unix-socket:...
Yep, agreed.
What's the actual use case for such a feature?
Hmmm I can't think of one, so yeah probably isn't that important.
Another interesting aspect of the new jobserver protocol, would be dealing with implicit global token.
Suppose that you spawn a new task after getting a token from jobserver protocol, that task has an implicit token. for its main thread.
If that task goes on to create more thread, which could happen in multi-threaded rustc, or in build-script using cc which spawns compiler processes, when spawning new thread they need to take into consideration of that implicit token.
Otherwise they may not be able to spawn new thread/process forever, if the jobserver only allows one thread/process to execute at a time (which is a rare but possible use case).
At the very least it could block it for a long time, cc uses an atomic to record implicit global token
https://github.com/rust-lang/cc-rs/blob/76c377ae360ccd75a00d38b484c0b08e5a429462/src/parallel/job_token.rs#L102
I wonder if there's any better design than this that can take care of this scenario without having to manually handle it.
Suppose that you spawn a new task after getting a token from jobserver protocol, that task has an implicit token. for its main thread.
When the parent spawns its child, could the child directly inherit the FD for the connection acquired by the parent? Then, the parent could close their FD after the fork, and the child now "owns" that socket connection, and it will be automatically released when the child exits. We still need logic for this (to ensure the child inherits the FD and the parent immediately closes it), but I think having it in child spawning is mildly better than requiring every child to handle this.
EDIT: oh, the child would also need to be able to release this FD when it wanted, that's kind of your point. The FD index could be be passed in an env var.
Yeah I think that could work.
Having the fd in child would also have another benefit:
The job token will be released whenever the child exits, instead of whenever the one that acquires the job token has reaped the child or itself exits.
The use of unix socket does seem like a cleaner design, compared to (named) pipe.
When the parent spawns its child, could the child directly inherit the FD for the connection acquired by the parent?
That re-introduces the problem that the new named pipe jobserver protocol of make solves which the old anonymous pipe protocol suffered from: You are forced to initialize the jobserver client before any code runs that could open file descriptors. Otherwise you might misinterpret a locally opened file descriptor as an inherited one originating from the jobserver, which violates IO safety. This also necessitates jobserver::Client::from_env being unsafe.
Hmmm yeah that's true.
I haven't been able to actually get this scheme to work in practice, on Linux: https://gist.github.com/andrewrk/d8014959638ff67ff87a183b639304e2
connect() on an un-accepted connection does not block (?) poll(POLLOUT) on a connected, un-accepted connection does not block (??)
@andrewrk wrote:
Alright, never mind, there was a solution right in front of my face. Here's an alternate scheme:
Root process listens on a unix domain socket. The address is passed down the process tree via environment variable. To get a thread token a child process connects to the socket. To return the token, disconnect. The root process only accepts N concurrent connections and stops accepting when it's maxed out.
When a child process dies, the OS causes the connection to end, allowing the root process to accept a new connection.
This seems like a great solution that's reasonably portable; it'd work on any UNIX, and it'd work on modern Windows.
@andrewrk wrote:
I haven't been able to actually get this scheme to work in practice, on Linux: https://gist.github.com/andrewrk/d8014959638ff67ff87a183b639304e2
connect() on an un-accepted connection does not block (?) poll(POLLOUT) on a connected, un-accepted connection does not block (??)
It'd be an extra syscall, but one thing that should work portably on any UNIX system would be to connect and then read one byte, and have the jobserver always write a single byte to each socket after accepting it.
If that turns out to be the only way to get this to work, then another possible protocol, which might be faster for multi-job processes but marginally slower for single-job processes, would be for clients of this protocol to first write a byte to the connected socket for each job it wants to start, and then read a byte per job. This would allow connecting a socket once and then requesting multiple jobs from it, rather than connecting a socket per job. For a multi-job process, that reduces the overhead to a write and read per job (both of which may be possible to batch), and it doesn't require the overhead of updating the fd table repeatedly. But for a single-job process, this would mean one more syscall (a write) on top of what's now already a socket/connect/read.
It's unfortunate that this protocol already requires 3 syscalls to start each job, and another to close it (if not closing it via process exit). This isn't substantially worse than the 2+1 of the current jobserver protocol (which requires open and read and a subsequent write) or the 1+1 of the old one (inherit fd, do a read and subsequent write), but it's not ideal. However, I don't see any obvious way to do substantially better, even without taking portability into account, while preserving the property of automatically releasing on close.
I am observing one thread blocking on connect(), and then another thread calls shutdown() on the same sockfd, the first thread continues to block on the connect syscall: https://gist.github.com/andrewrk/bf46cff35d34692558563f75bc8d0e62
Any ideas on how to solve this? The problem is on deinitialization; the main thread needs a way to wake up the worker threads that are blocking on connect().
@andrewrk Based on the earlier experiment, I had thought connect never blocked in the first place, which is why we need the read.
Investigating further, it looks like eventually connect will block. depending on the backlog size specified in listen. (Looking at the zig PR thread, it looks like you found something similar.)
For instance, if you have a backlog of 2, and your listener accepts only one connection and loops forever (never accepting other connections), the first connect will succeed (being accepted), the next 3 (?!) will successfully connect and block in read, and after that, additional connect attempts will block.
Likewise, with a backlog of 1, the first connection succeeds (being accepted), the next 2 (?!) will successfully connect and block in read, and after that, additional connect attempts will block.
This behavior seems weird enough that it probably can't be counted on, and I'd be shocked if it doesn't vary by platform, so I'm not sure if counting on it is a good idea. I'm not sure if it's viable to count on setting the backlog to a large number (in any case you shouldn't set it to more than SOMAXCONN), nor should you count on any particular backlog behavior from the jobserver (since many implementations are unlikely to change the default). I think you're going to have to allow for either connect or read blocking.
From what I can tell, the behavior of non-blocking connect is also both finicky and non-standard here; I don't think you're likely to be able to use poll to wait for connect to succeed. On Linux at least, non-blocking UNIX sockets just fail connect with EAGAIN rather than returning EINPROGRESS, so you I don't think you can poll for completion of connect. This seems like a substantial deficiency. I haven't found any reasonable way to do a non-blocking connect to a UNIX socket on Linux; there are various unreasonable ways that require jumping through substantial hoops (e.g. requiring io_uring, or spawning a thread/process).
You could use the sockopt SO_SNDTIMEO, which is standardized by POSIX, but I'm not sure all platforms apply that timeout to connect and not just send/write. In any case, this would cause you to have to repeatedly wake up when the timeout expires and re-invoke connect, which seems wasteful, and it'd incur a delay before you can kill worker threads. That might be acceptable, though, since you expect most blocks to occur on read (for which poll works) rather than connect. But in any case, you'd have to verify that it applies to connect on all targets.
I don't think you can portably send a signal to a specific thread (rather than a process), leaving aside all the other reasons not to want to use signals. pthread_cancel similarly seems like a bad idea.
I expect the behavior of closing the fd (or race-free equivalent like dup2ing /dev/null over the fd) would be finicky and non-standard. From reading the Linux code, I think that if connect has reached the point of waiting to connect, it's just going to keep waiting and not care about its socket having been closed.
You could move the blocking from the worker threads themselves to a separate thread that handles making the connections, which would allow you to terminate the worker threads (since they could wait on something like a channel), but that just moves the problem to the connection thread.
If you want something to be able to block in either connect or read, but you also want to be able to bail out and end the worker pool by some means other than your process ending, I don't have anything else obvious to suggest; non-blocking connect doesn't work on UNIX sockets, and blocking connect can't easily be terminated.
@andrewrk A correction: it looks like signals might be viable after all. https://www.man7.org/linux/man-pages/man3/pthread_kill.3.html exists. You could install a signal handler, invoke pthread_kill, the handler will run on that thread, the handler can set a flag, and after the handler returns, connect should return with EINTR and the code that invoked connect can check the flag.
It'd be mildly painful for something general-purpose like zig's worker pool, you'd probably want to let the user pass in a signal number that it's OK for you to use (which can have a reasonable default), but it should work.
Update: I'm still interested in this but have put the project on ice for the time being. If y'all end up looking into it and figuring out something that works well, Zig will follow Cargo's lead. For now though, I was unable to come up with something satisfactory with the available primitives.
I spent some time looking at this yesterday, and I came up with a few schemes which seem hopeful.
System V semaphores
What we're actually trying to implement here is just a counting semaphore with thread termination safety. It turns out that an old System V semaphore primitive (not to be confused with POSIX semaphores!) solves this perfectly. When you increment or decrement the value of one of these semaphores (with semop), you can specify the flag SEM_UNDO to signal to the kernel that this modification should be reverted when the process exits.
So in our case, we would specify SEM_UNDO on both the decrement (acquiring a token) and the increment (releasing a token), and they neatly cancel out so that everything Just Works---when this process exits it's as if it never modified the semaphore. Note that this isn't doing something inefficient in the kernel like storing a list of all these modifications; it's just incrementing/decrementing a per-process "adjustment" value which will get added into the semaphore upon process termination.
The calls in the server would look like this:
/* Despite the name `IPC_PRIVATE`, the return value here is a System V IPC handle
* (not an FD!) which can be safely passed straight to child processes. */
const int sem_id = semget(IPC_PRIVATE, 1, 0777);
semctl(sem_id, 0, SETVAL, num_tokens);
/* Then pass `sem_id` to children somehow (probably via an env var). */
And in a worker:
/* Somehow acquire `sem_id` as exposed by the server (e.g. read an env var) */
const int sem_id = ...;
/* To acquire a token: */
{
struct sembuf op_buf = { .sem_num = 0, .sem_op = -1, .sem_flg = SEM_UNDO };
semop(sem_id, &op_buf, 1);
}
do_some_work(); /* we have a token, use it */
/* To release a token: */
{
struct sembuf op_buf = { .sem_num = 0, .sem_op = 1, .sem_flg = SEM_UNDO };
semop(sem_id, &op_buf, 1);
}
I've tested this concept with a simple implementation in Zig, and it does seem to work nicely. At least on Linux, each semop is a single syscall, so acquiring and releasing a token are one syscall each. The kernel implementation (here) looks totally reasonable, so I have no performance concerns.
It seems to me that where available, this is by far the best option we've seen yet. The only slight concern I have is support for this primitive on different platforms; POSIX semaphores (which do not have the process termination safety properties we need) have largely replaced System V semaphores in common usage, so it's possible some implementations don't support the latter properly or have buggy implementations. But perhaps I'm being too pessimistic, and they still work fine on many OSes; this needs investigating.
AF_UNIX + SOCK_DGRAM + SCM_RIGHTS + pipe2
I came up with this idea first before realising that System V semaphores were viable. The scheme I'm about to describe seems to be pretty clearly strictly worse than the above one, so is probably only worth thinking about if System V semaphores turn out to be problematic.
The idea is to model each token as a pipe, sending the write-end to a client through a Unix domain socket, and using the fact that the server can detect when the pipe is broken (e.g. as a POLLERR from a poll call).
Server logic:
- Open an
AF_UNIXSOCK_DGRAMsocket somewhere (presumably we pass the path down to children via an env var).- The reason we're using a datagram socket is purely to optimize syscalls: this will allow clients to receive data without needing to
connect.
- The reason we're using a datagram socket is purely to optimize syscalls: this will allow clients to receive data without needing to
- For
0..num_tokens, callpipe2to get pairs of FDs. - Using
sendmsgwithSCM_RIGHTS, send the write-end of each pipe to the socket - Close the write-ends of all pipes.
pollon the read-ends of all pipes.- Clients release tokens by closing the write-end of a pipe. This causes the server to see
POLLERRfrom the corresponding read-end, because the write-end of the pipe is no longer open in any process.- When the server sees this, it must create a new pipe for the token with
pipe2and send that pipe's write end back over the socket to let another client work.
- When the server sees this, it must create a new pipe for the token with
Corresponding client logic:
- Determine the path to the server's socket (probably from an env var).
- To acquire a token, call
recvmsgon the socket. You will get a datagram with an attached FD; this represents your token. - To release a token, close the attached FD. (Of course, if the process terminates unexpectedly, the FD will automatically close.)
On the client end, this is prety simple: 1 syscall to acquire a token, and 1 syscall to release it. The server isn't looking so great though; per acquire-release pair, the server does pipe2, sendmsg (transferring the write end of the pipe), close (on the write end of the pipe), poll (to wait for release), close (on the read end of the broken pipe). That's pretty bad. I suppose it would at least be fairly easy to use batching mechanisms like io_uring on the server here. Like I said, this seems pretty terrible compared to the System V semaphore scheme.
Based on the System V semaphore idea above, and some collaboration with other Zig team members on a design for Windows, I've put together a concrete proposed protocol which I've called the "Robust Jobserver". You can find it here: https://codeberg.org/mlugg/robust-jobserver/src/branch/main/spec.md
I'd be interested to hear if anyone has thoughts or concerns regarding the specification as written. If not, I intend to work on supporting it in the Zig compiler and build system at some point soon.
IMHO the windows implementation using named pipe, how do we know the exact name for the jobserver?
Do we have to guess?
I don't completely understand the question (sounds like something may have been lost in translation?). To be clear: the server picks any pipe name it likes, and communicates it to clients through the ROBUST_JOBSERVER environment variable. If that variable's value is win32pipe:foo, then the name of the pipe is \\.\foo.
But there could be multiple client, and for windows you'd have one named pipe per token, so for each client, what is the name of the named pipe?
It can't be a fixed name right?
Windows has a concept called named pipe "instances", which is basically a bunch of pipes with the same name. What Windows calls a named pipe is actually much closer to what us POSIX-minded folk would consider a local socket: it's (potentially) bidirectional, and supports arbitrarily many connections (through this "instances" concept), with a "server" accepting connections from "clients". So, yes, all clients connect to the same pipe name. For more information you can read Microsoft's documentation here: https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipes
For the record, we already (before I wrote up the spec) successfully wrote and tested PoC implementations for both of the listed communication methods. The win32pipe method has been tested on Windows, and the sysvsem method has been tested on Linux and macOS.
Thanks for the explanation!
I'm not really familiar with Windows so I was confused by this, it'd be great if the doc explained that to people not familiar with Windows.
Just curious, is it possible to avoid having one thread for each token on windows? Is it possible to have something "async"?
Yes; one thread per token is easiest to explain, hence why the spec describes it, but our PoC implementation used just one server thread with asynchronous pipe operations and Asynchronous Procedure Calls.
That looks pretty nice, looking forward to a rust implementation in rust-lang/jobserver-rs