ray
ray copied to clipboard
[Jobs] Track real job process
Signed-off-by: Cindy Zhang [email protected]
Why are these changes needed?
Resolves the issue described in https://github.com/ray-project/ray/issues/31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout.
Related issue number
"Closes #31274"
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s
) in this PR. - [x] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Possibly relevant failure https://buildkite.com/ray-project/oss-ci-build-pr/builds/8335#01855f60-47df-4ec7-9416-0404454d15c1/3428-3844
Looks good! Does the new test fail without the change from this PR? (If not, we should have a test like that)
Yup, the new test_stop_job_timeout
fails before this PR.