zig
zig copied to clipboard
build: Terminating a build process does not terminate child processes
build.zig
:
const std = @import("std");
pub fn build(b: *std.Build) void {
const build_test = b.addTest(.{ .root_source_file = .{ .path = "repro.zig" } });
build_test.linkLibC();
const run_test = b.addRunArtifact(build_test);
b.step("test", "Run the test").dependOn(&run_test.step);
}
repro.zig
:
const std = @import("std");
const builtin = @import("builtin");
extern "c" fn kill(pid: std.c.pid_t, sig: c_int) c_int;
extern "c" fn getppid() std.c.pid_t;
test "loop" {
_ = kill(getppid(), std.c.SIG.TERM);
while (true) asm volatile ("");
}
$ ps -o pid,ppid,cmd
PID PPID CMD
51734 51721 /bin/bash --posix
58662 51734 ps -o pid,ppid,cmd
$ zig build test
error: the following build command crashed:
./zig-cache/o/04a82e29edda016788b7663566160476/build zig . ./zig-cache ~/.cache/zig --seed 0x97dad68b test
$ ps -o pid,ppid,cmd
PID PPID CMD
51734 51721 /bin/bash --posix
58768 1 ./zig-cache/o/78b4ec0e583741aa4edd00a8b661577f/test --listen=-
58771 51734 ps -o pid,ppid,cmd
Terminating the build process should not leave orphaned child processes running.
This bug has seemingly resulted, over a period from Oct 15 to Dec 21, in 56 currently active orphan processes on the aarch64-macos
CI using up valuable cpu time and causing CI runs to slow down over time.
Quickfix: Check why SIGHUP is not send to the child once the parent dies, which has as default behavior to abnormal termination. Unless sigpgrp is called, the process should have the same process group as the parent and thus be signaled with SIGHUP (on posix).
A proper solution is ~~not~~ fun and this is an unfixed posix design SHENNANIGAN/flaw. Take a look at the wild zoo of hacks here https://stackoverflow.com/questions/284325/how-to-make-child-process-die-after-parent-exits/17589555#17589555.
Some more constrains:
- should handle sigprg in the child process
- should likely also handle
_ = kill(getppid(), std.c.SIG.KILL);
- no Linux/Mac/Windows/per OS solution or at least plan for byos
- should not rely on SIGPIPE as systemd/other init systems might disable those [would be most simple posix solution], see https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html
- should not rely on single-OS (complex) security extensions or complex implementation in general (unless it would be portable)
- should not rely on a specific pid 1 implementation to handle tracking processes to remain portable
- should not handle malicious behavior (sandboxing) to limit scope
This leaves as viable ones:
- open pipe trick (EOF signals parent has died) but this becomes unviable on grandparent to grandchild situations if the parent of the grandchild dies
- polling if parent process becomes pid 1 (posix only)
- Windows has a dedicated "Job API", see https://stackoverflow.com/questions/53208/how-do-i-automatically-destroy-child-processes-in-windows and looks clean enough to use as Windows-only solution
- async file descriptor to wait for the parent process to finish (non-blocking waitpid())
- Linux has a prctl call, but only for SIGHUP and more complex apis
- Mac relies kqueue https://stackoverflow.com/a/6484903
I have no personal preference except that the Mac kqueue solution looks abit too complex for my taste and that the Posix/Unix process API (+ signaling) is very badly designed, see also https://news.ycombinator.com/item?id=35264487.
To be clear, in this case you are not worried about arbitrary sub-processes forked by CI right? Just Zig-spawned builders/tester processes that are all running Zig code (and are all written to be compatible with this goal)? (Handling the case where you want to kill child processes running arbitrary, non-Zig binaries seems much harder). In that case the open-pipe trick seems reasonable to me. You would need to repeat it from parent to child, and then from child to grand-child (I don't think the grand-child should get its grand-parent's pipe). Its not clear to me why you marked this one as "unviable".
Assuming the child processes are generally of the form .../test --listen
, an alternative might be to give child processes a timeout. E.g., after some time with no new commands sent, it gives up (perhaps with some verbose logging). Especially if the parent is robust enough to handle child processes disappearing anyway.
Yet another approach might be to just reboot the container/vm/environment hosting the CI runs once or twice a day to clean out the cruft.
So far I've only seen https://jmmv.dev/2019/11/wait-for-process-group-darwin.html on macos, which feels like an ugly solution and still does not handle orphaned descendants from SIGKILLing the parents.
The to me only known alternative is to provide each worker process with a hard timeout, which SIGKILLs also its descendants, which sounds not perfect for debugging. As for having something to debug including handling those cases, one would need to use soft timeouts and handle those timeout signals via installing a signal handler (and crashing with tearing down descendant processes) safely.
Please correct me that there is a more sane way to reliably handle process groups on macos etc.