zig icon indicating copy to clipboard operation
zig copied to clipboard

build: Terminating a build process does not terminate child processes

Open jacobly0 opened this issue 6 months ago • 2 comments

build.zig:

const std = @import("std");

pub fn build(b: *std.Build) void {
    const build_test = b.addTest(.{ .root_source_file = .{ .path = "repro.zig" } });
    build_test.linkLibC();

    const run_test = b.addRunArtifact(build_test);
    b.step("test", "Run the test").dependOn(&run_test.step);
}

repro.zig:

const std = @import("std");
const builtin = @import("builtin");

extern "c" fn kill(pid: std.c.pid_t, sig: c_int) c_int;
extern "c" fn getppid() std.c.pid_t;

test "loop" {
    _ = kill(getppid(), std.c.SIG.TERM);
    while (true) asm volatile ("");
}
$ ps -o pid,ppid,cmd
    PID    PPID CMD
  51734   51721 /bin/bash --posix
  58662   51734 ps -o pid,ppid,cmd
$ zig build test
error: the following build command crashed:
./zig-cache/o/04a82e29edda016788b7663566160476/build zig . ./zig-cache ~/.cache/zig --seed 0x97dad68b test
$ ps -o pid,ppid,cmd
    PID    PPID CMD
  51734   51721 /bin/bash --posix
  58768       1 ./zig-cache/o/78b4ec0e583741aa4edd00a8b661577f/test --listen=-
  58771   51734 ps -o pid,ppid,cmd

Terminating the build process should not leave orphaned child processes running.

This bug has seemingly resulted, over a period from Oct 15 to Dec 21, in 56 currently active orphan processes on the aarch64-macos CI using up valuable cpu time and causing CI runs to slow down over time.

jacobly0 avatar Dec 21 '23 23:12 jacobly0

Quickfix: Check why SIGHUP is not send to the child once the parent dies, which has as default behavior to abnormal termination. Unless sigpgrp is called, the process should have the same process group as the parent and thus be signaled with SIGHUP (on posix).

A proper solution is ~~not~~ fun and this is an unfixed posix design SHENNANIGAN/flaw. Take a look at the wild zoo of hacks here https://stackoverflow.com/questions/284325/how-to-make-child-process-die-after-parent-exits/17589555#17589555.

Some more constrains:

  • should handle sigprg in the child process
  • should likely also handle _ = kill(getppid(), std.c.SIG.KILL);
  • no Linux/Mac/Windows/per OS solution or at least plan for byos
  • should not rely on SIGPIPE as systemd/other init systems might disable those [would be most simple posix solution], see https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html
  • should not rely on single-OS (complex) security extensions or complex implementation in general (unless it would be portable)
  • should not rely on a specific pid 1 implementation to handle tracking processes to remain portable
  • should not handle malicious behavior (sandboxing) to limit scope

This leaves as viable ones:

  • open pipe trick (EOF signals parent has died) but this becomes unviable on grandparent to grandchild situations if the parent of the grandchild dies
  • polling if parent process becomes pid 1 (posix only)
  • Windows has a dedicated "Job API", see https://stackoverflow.com/questions/53208/how-do-i-automatically-destroy-child-processes-in-windows and looks clean enough to use as Windows-only solution
  • async file descriptor to wait for the parent process to finish (non-blocking waitpid())
  • Linux has a prctl call, but only for SIGHUP and more complex apis
  • Mac relies kqueue https://stackoverflow.com/a/6484903

I have no personal preference except that the Mac kqueue solution looks abit too complex for my taste and that the Posix/Unix process API (+ signaling) is very badly designed, see also https://news.ycombinator.com/item?id=35264487.

matu3ba avatar Dec 23 '23 00:12 matu3ba

To be clear, in this case you are not worried about arbitrary sub-processes forked by CI right? Just Zig-spawned builders/tester processes that are all running Zig code (and are all written to be compatible with this goal)? (Handling the case where you want to kill child processes running arbitrary, non-Zig binaries seems much harder). In that case the open-pipe trick seems reasonable to me. You would need to repeat it from parent to child, and then from child to grand-child (I don't think the grand-child should get its grand-parent's pipe). Its not clear to me why you marked this one as "unviable".

Assuming the child processes are generally of the form .../test --listen, an alternative might be to give child processes a timeout. E.g., after some time with no new commands sent, it gives up (perhaps with some verbose logging). Especially if the parent is robust enough to handle child processes disappearing anyway.

Yet another approach might be to just reboot the container/vm/environment hosting the CI runs once or twice a day to clean out the cruft.

rootbeer avatar Dec 23 '23 07:12 rootbeer

So far I've only seen https://jmmv.dev/2019/11/wait-for-process-group-darwin.html on macos, which feels like an ugly solution and still does not handle orphaned descendants from SIGKILLing the parents.

The to me only known alternative is to provide each worker process with a hard timeout, which SIGKILLs also its descendants, which sounds not perfect for debugging. As for having something to debug including handling those cases, one would need to use soft timeouts and handle those timeout signals via installing a signal handler (and crashing with tearing down descendant processes) safely.

Please correct me that there is a more sane way to reliably handle process groups on macos etc.

matu3ba avatar Jan 22 '24 23:01 matu3ba