athens icon indicating copy to clipboard operation
athens copied to clipboard

Race condition in subprocess handling

Open mtharp opened this issue 7 months ago • 1 comments

Describe the bug v0.16.0 introduced a subprocess reaper via #2043. It appears to have a race condition where the reaper can wait() on a process spawned via os/exec.Cmd.Run before Run can wait on it and get its return code.

The symptom is waitid: no child processes errors and spurious 404 Not Found responses, as seen in #2048. The fix attempted there, however, only suppresses the symptom, but the return code of the child process is lost so the proxy doesn't know whether the command succeeded.

athens' Dockerfile specifies tini as the entrypoint to the container. This should adequately care for wayward grandchildren - they will be re-parented and waited on by PID 1 i.e. tini which is there for just that purpose. I was previously running v0.11.0 for several years and in that time I don't think I saw any zombie processes with tini in place. I suspect that users who are having zombie trouble are inadvertently running athens without tini.

Theoretically it's possible for athens to internalize some means of reaping grandchild processes, but this would be redundant with tini and it seems challenging to implement it in-process without interfering with os/exec. In my opinion, it makes the most sense to revert #2048 and #2043 and let tini handle reaping duties, as it was before v0.16.0.

Error Message Typical case in which the reaper gets to a process before exec.Cmd can:

Jun 03 02:38:36 goproxy.example.com athens-proxy[2289053]: INFO[6:38AM]: reaped child process 1449344, exit status: 1
Jun 03 02:38:36 goproxy.example.com athens-proxy[2289053]: INFO[6:38AM]: wait: no child processes: go: module google.golang.org/protobuf/runtime/protoiface: reading https://proxy.golang.org/google.golang.org/protobuf/runtime/protoiface/@v/list: 404 Not Found
Jun 03 02:38:36 goproxy.example.com athens-proxy[2289053]:         server response: not found: module google.golang.org/protobuf/runtime/protoiface: no matching versions for query "latest"

To Reproduce Normal traffic (esp. with @v/list and/or requests for nonexistent modules) under moderate load.

Expected behavior No spuriously reaped processes, no "wait" errors, and legitimate error cases (e.g. a nonexistent repository) are handled normally.

Environment (please complete the following information):

  • OS: linux/amd64 in podman
  • Go version : 1.24.x
  • Proxy version : v0.16.0
  • Storage: disk

mtharp avatar Jun 03 '25 22:06 mtharp

I see this as well when firing many concurrent requests against athens. Requests sporadically fail with incorrect 404 response and these "reaped child process" messages followed by "waitid: no child processes" are seen in the log. If you need a client to reproduce this, you can try my updates tool and run it like this inside any go project:

npx updates -V -f go.mod

Meanwhile I have downgraded to v0.15.4 because of this issue.

silverwind avatar Jul 21 '25 12:07 silverwind