[Dashboard] Add cleanup of `job_table` in `delete_job`
Why are these changes needed?
The current delete_job api only accepts the submission_id and deletes the submission info, which can lead to job info remaining in the job_table. In large-scale persistent Ray clusters, the job_table cannot be cleaned up, significantly occupying the GCS memory and the Redis used for fault tolerance.
This PR makes thedelete_job api can accept either a submission_id or a job_id, and fully cleans up the submission info as well as the job_table, improving the completeness of the deletion operation.
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
@edoakes @rkooo567 You probably have context on this since you reviewed the original PR https://github.com/ray-project/ray/pull/30056. Is it intentional to not delete the job from job_table or it's just a oversight?
@edoakes @rkooo567 You probably have context on this since you reviewed the original PR #30056. Is it intentional to not delete the job from
job_tableor it's just a oversight?
I have no idea -- it might be used for observability e.g., dashboard? @alanwguo might know
When a user calls delete_job, the intention is to clean up the job info and avoid leaving behind unnecessary metadata. Currently, only the submission info is cleared, making the job invisible in the dashboard. However, detail info still remain in the job_table. Even if the user delete each job individually, running tens of thousands of jobs will cause a significant surge in GCS memory and Redis. This is why we think it's necessary to clean up. @jjyao @alanwguo what's your take on this issue?
Excuse me, do we need a code owner for further review? @alanwguo
Adding @hongchaodeng @ruisearch42 since this is to do with GCS.
If the delete API's purpose is to delete the entry for system maintanance, I think it makes sense to delete entry from GCS. @edoakes what's the exact purpose of this delete_job API (also is it internal or external)?
But before proceeding further, we should think again why we have two job table -- internal_kv and GCS JobTable? Shall we just unify them and keep everything under the GCS JobTable? In this way we will have only one source of truth and make the code easier to maintain and debug.
I think the job in job submission and the ray's job is not exactly the same thing now (core job is simply a ray driver, and job submission doesn't go through Ray's GCS, but it is the dashboard API). Agreed there's a way to integrate this though. I remember we discussed this before and somehow never happened. cc @edoakes do you remember any discussion related to this?
lmk when everything is addressed!
Please review. Thanks! @rkooo567
This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.
You can always ask for help on our discussion forum or Ray's public slack channel.
If you'd like to keep this open, just leave any comment, and the stale label will be removed.
This pull request has been automatically closed because there has been no more activity in the 14 days since being marked stale.
Please feel free to reopen or open a new pull request if you'd still like this to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for your contribution!