rules_itest icon indicating copy to clipboard operation
rules_itest copied to clipboard

`svcctl`'s `/v0/wait` endpoint can prematurely report "exec: WaitDelay expired before I/O complete" due to an overly aggressive `WaitDelay`

Open HongboDu-at opened this issue 7 months ago • 0 comments

Observed Behavior:

When a service managed by rules_itest's runner is stopped (e.g., via svcctl's /v0/kill endpoint) and subsequently waited upon (via svcctl's /v0/wait endpoint), the /v0/wait call can fail with an HTTP 500 error. The body of this error is exec: WaitDelay expired before I/O complete.

This issue arises even if the underlying service process has successfully exited.

Root Cause Analysis:

The core of the problem lies in the initializeServiceCmd function within runner/runner.go. Specifically, for instances where s.Type == "service", the exec.Cmd.WaitDelay is explicitly set to 1 (representing 1 nanosecond):

// runner/runner.go
func initializeServiceCmd(ctx context.Context, instance *ServiceInstance) error {
    // ...
    // Even if a child process exits, Wait will block until the I/O pipes are closed.
    // They may have been forwarded to an orphaned child, so we disable that behavior to unblock exit.
    if s.Type == "service" {
        cmd.WaitDelay = 1 // Set to 1 nanosecond
    }
    // ...
    instance.waitErrFn = sync.OnceValue(func() error {
        res := cmd.Wait() // This Wait() call is affected
        instance.SetDone()
        return res
    })
    // ...
}

The Go os/exec.Cmd.Wait() documentation explains that WaitDelay is the maximum time to wait for the command to exit and for its I/O pipes (stdout/stderr) to be closed after the process is killed or its context is canceled.

Setting WaitDelay to 1 nanosecond is extremely aggressive. After a service process exits, its I/O pipes need a small amount of time to be fully processed and closed from the parent process's perspective (the runner in this case, which might be wrapping stdout/stderr with its own loggers). If this pipe closure doesn't happen within the 1-nanosecond window provided by WaitDelay, cmd.Wait() returns the "exec: WaitDelay expired before I/O complete" error.

This error is then propagated by svcctl's handleWait function as an HTTP 500 status, leading to test failures or unexpected behavior in service management scripts that rely on a successful wait.

Impact:

This can cause tests that stop and wait for services to terminate to fail sporadically or consistently, depending on system load and the nature of the service being stopped. It makes the service shutdown and wait process unreliable.

Proposed Fix:

The cmd.WaitDelay = 1 setting in runner/runner.go should be adjusted to a more realistic value.

  1. Option A (Recommended): Use a Small, Reasonable Timeout: Change cmd.WaitDelay = 1 to a value like 50 * time.Millisecond or 100 * time.Millisecond. This provides a more practical window for I/O pipes to close while still addressing the original concern about orphaned pipes causing indefinite hangs.

    --- a/runner/runner.go
    +++ b/runner/runner.go
    @@ -250,7 +250,7 @@
     	// Even if a child process exits, Wait will block until the I/O pipes are closed.
     	// They may have been forwarded to an orphaned child, so we disable that behavior to unblock exit.
     	if s.Type == "service" {
    -		cmd.WaitDelay = 1
    +		cmd.WaitDelay = 100 * time.Millisecond // Or another reasonable duration
     	}
    
     	instance.cmd = cmd
    
  2. Option B (Alternative): Use Go's Default: Remove the cmd.WaitDelay = 1 line entirely. This would make cmd.Wait() use Go's default WaitDelay, which is 1 second if the field is 0. This is safer but might be too long if the original concern about indefinite hangs due to orphaned pipes is still highly prevalent.

    --- a/runner/runner.go
    +++ b/runner/runner.go
    @@ -250,9 +250,6 @@
     	// Even if a child process exits, Wait will block until the I/O pipes are closed.
     	// They may have been forwarded to an orphaned child, so we disable that behavior to unblock exit.
     	if s.Type == "service" {
    -		cmd.WaitDelay = 1
     	}
    
     	instance.cmd = cmd
    

Adjusting WaitDelay to a more practical duration should resolve the premature "WaitDelay expired" errors and make the service stopping and waiting mechanism more robust.

HongboDu-at avatar May 20 '25 03:05 HongboDu-at