ray icon indicating copy to clipboard operation
ray copied to clipboard

[Jobs] Revisit Ray Job execution and monitoring

Open alexeykudinkin opened this issue 9 months ago • 2 comments

Why are these changes needed?

Context

Motivations for this refactoring are multiple:

  • Consolidating all of the job management in one place (JobSupervisor; previously spread b/w JobManager and JobSupervisor)

  • Decoupling management and monitoring of the job execution from the execution of the job driver (previously was coupled inside JobSupervisor)

These steps are necessary to be able

  • Perform job management and monitoring exclusively from the head-node
  • Run actual job drivers on any node (worker or head)

Changes

With stated goals in mind following are primary changes that were implemented (with the rest just to facilitate this migration):

  1. All of the job management and monitoring is consolidated inside JobSupervisor (always running on a head node)
  2. Actual job driver execution is performed by (descendent) JobExecutor actor (could run on any node)

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

alexeykudinkin avatar May 03 '24 02:05 alexeykudinkin