command-group "blocking waitpid returned pid=0" from `AsyncGroupChild::wait`

Seems to be a cancel safety issue, this happens when i wait() with a timeout and then wait() again:

This reproducer:

use command_group::AsyncCommandGroup;
use tokio::process::Command;

#[tokio::main]
async fn main() {
    let mut group = Command::new("sh")
        .arg("-c")
        .arg("sleep 30; echo done!")
        .group_spawn()
        .unwrap();
    println!("spawned");

    match tokio::time::timeout(std::time::Duration::from_secs(1), group.wait()).await {
        Ok(res) => {
            println!("command exited or waiting failed: {res:?}");
        }
        Err(_) => {
            println!("command took too long");
        }
    }

    group.kill().unwrap();
    println!("killed");
    group.wait().await.unwrap();
    println!("finished waiting");
}

Prints this on my machine:

spawned
command took too long
killed
thread 'main' panicked at src/main.rs:26:24:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: "blocking waitpid returned pid=0" }

Oct 27 '23 00:10 9999years

Sure, to conform to the Tokio API this future should be fused. Under unix you're not really supposed to do that (call wait twice) though.

(Also love your "people i'm not" website page btw)

Oct 27 '23 01:10 passcod

Under unix you're not really supposed to do that (call wait twice) though.

Ah, that's the context I was missing. I read the waitpid(2) man page but I didn't get that.

I hoped I could work around this by calling wait(), storing the future, then calling kill() and waiting on the wait() future, but wait() is a mutable borrow so that doesn't work.

Oct 27 '23 18:10 9999years

This seems to work, although I had to write the killpg call myself because the borrow checker complained about calling AsyncCommandGroup::kill when I had the AsyncCommandGroup::wait future still pending.

use command_group::AsyncCommandGroup;
use nix::sys::signal;
use nix::sys::signal::Signal;
use nix::unistd::Pid;
use tokio::process::Command;

#[tokio::main]
async fn main() {
    let mut group = Command::new("sh")
        .arg("-c")
        .arg("sleep 30; echo done!")
        .group_spawn()
        .unwrap();
    println!("spawned");
    let pgid = Pid::from_raw(group.id().unwrap() as i32);

    let mut wait = std::pin::pin!(group.wait());
    match tokio::time::timeout(std::time::Duration::from_secs(1), &mut wait).await {
        Ok(res) => {
            println!("command exited or waiting failed: {res:?}");
        }
        Err(_) => {
            println!("command took too long");
        }
    }

    signal::killpg(pgid, Signal::SIGKILL).unwrap();
    println!("killed");
    wait.await.unwrap();
    println!("finished waiting");
}

Oct 27 '23 19:10 9999years

I'd be happy with a PR that made this cancel-safe, for the record. Had a look earlier but it wasn't obvious.

Oct 30 '23 09:10 passcod

Right, I had a look at this again and I don't think the issue is really cancel safety. What happens is that when you call wait (or waitpid), that tells the kernel that it can clean up the resources of the process once it's exited. We can't cancel that call, so when you cancel the .wait() in application code all that does is drop the thread that has called wait. If the process stops anyway in between, the kernel will clean it up, and calling wait again will error (at best; if the PID is recycled you might call the second wait on the wrong process).

Indeed the wait() in command-group is fused already: calling it again after it went all the way to process completion will return the same ExitStatus.

I believe Tokio's 'trick' is that it also listens for SIGCHLD so that, in the background, it can handle if a process exits while nothing was "actively" wait()ing it.

Nov 04 '23 10:11 passcod