ray
ray copied to clipboard
[Jobs] Revisit Ray Job execution and monitoring
Why are these changes needed?
Context
Motivations for this refactoring are multiple:
-
Consolidating all of the job management in one place (JobSupervisor; previously spread b/w JobManager and JobSupervisor)
-
Decoupling management and monitoring of the job execution from the execution of the job driver (previously was coupled inside JobSupervisor)
These steps are necessary to be able
- Perform job management and monitoring exclusively from the head-node
- Run actual job drivers on any node (worker or head)
Changes
With stated goals in mind following are primary changes that were implemented (with the rest just to facilitate this migration):
- All of the job management and monitoring is consolidated inside JobSupervisor (always running on a head node)
- Actual job driver execution is performed by (descendent) JobExecutor actor (could run on any node)
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s
) in this PR. - [ ] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/
under the corresponding.rst
file.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(