ray icon indicating copy to clipboard operation
ray copied to clipboard

[Jobs] Track real job process

Open zcin opened this issue 2 years ago • 2 comments

Signed-off-by: Cindy Zhang [email protected]

Why are these changes needed?

Resolves the issue described in https://github.com/ray-project/ray/issues/31274. On Linux systems, when a stop signal is sent, instead of killing + waiting on only the shell process (which starts the actual job as a child process), we want to kill all the children of the shell process along with the shell process itself, and poll all processes until they exit or send a force SIGKILL on timeout.

Related issue number

"Closes #31274"

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

zcin avatar Dec 23 '22 00:12 zcin

Possibly relevant failure https://buildkite.com/ray-project/oss-ci-build-pr/builds/8335#01855f60-47df-4ec7-9416-0404454d15c1/3428-3844

architkulkarni avatar Dec 29 '22 21:12 architkulkarni

Looks good! Does the new test fail without the change from this PR? (If not, we should have a test like that)

Yup, the new test_stop_job_timeout fails before this PR.

zcin avatar Dec 29 '22 21:12 zcin