ray icon indicating copy to clipboard operation
ray copied to clipboard

[Dashboard] Add cleanup of `job_table` in `delete_job`

Open liuxsh9 opened this issue 1 year ago • 3 comments

Why are these changes needed?

The current delete_job api only accepts the submission_id and deletes the submission info, which can lead to job info remaining in the job_table. In large-scale persistent Ray clusters, the job_table cannot be cleaned up, significantly occupying the GCS memory and the Redis used for fault tolerance.

This PR makes thedelete_job api can accept either a submission_id or a job_id, and fully cleans up the submission info as well as the job_table, improving the completeness of the deletion operation.

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

liuxsh9 avatar Jun 21 '24 01:06 liuxsh9

@edoakes @rkooo567 You probably have context on this since you reviewed the original PR https://github.com/ray-project/ray/pull/30056. Is it intentional to not delete the job from job_table or it's just a oversight?

jjyao avatar Jun 21 '24 04:06 jjyao

@edoakes @rkooo567 You probably have context on this since you reviewed the original PR #30056. Is it intentional to not delete the job from job_table or it's just a oversight?

I have no idea -- it might be used for observability e.g., dashboard? @alanwguo might know

edoakes avatar Jun 24 '24 14:06 edoakes

When a user calls delete_job, the intention is to clean up the job info and avoid leaving behind unnecessary metadata. Currently, only the submission info is cleared, making the job invisible in the dashboard. However, detail info still remain in the job_table. Even if the user delete each job individually, running tens of thousands of jobs will cause a significant surge in GCS memory and Redis. This is why we think it's necessary to clean up. @jjyao @alanwguo what's your take on this issue?

liuxsh9 avatar Jun 28 '24 07:06 liuxsh9

Excuse me, do we need a code owner for further review? @alanwguo

liuxsh9 avatar Jul 05 '24 13:07 liuxsh9

Adding @hongchaodeng @ruisearch42 since this is to do with GCS.

anyscalesam avatar Jul 08 '24 17:07 anyscalesam

If the delete API's purpose is to delete the entry for system maintanance, I think it makes sense to delete entry from GCS. @edoakes what's the exact purpose of this delete_job API (also is it internal or external)?

rkooo567 avatar Aug 06 '24 16:08 rkooo567

But before proceeding further, we should think again why we have two job table -- internal_kv and GCS JobTable? Shall we just unify them and keep everything under the GCS JobTable? In this way we will have only one source of truth and make the code easier to maintain and debug.

I think the job in job submission and the ray's job is not exactly the same thing now (core job is simply a ray driver, and job submission doesn't go through Ray's GCS, but it is the dashboard API). Agreed there's a way to integrate this though. I remember we discussed this before and somehow never happened. cc @edoakes do you remember any discussion related to this?

rkooo567 avatar Aug 07 '24 20:08 rkooo567

lmk when everything is addressed!

rkooo567 avatar Aug 28 '24 20:08 rkooo567

Please review. Thanks! @rkooo567

liuxsh9 avatar Aug 30 '24 11:08 liuxsh9

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jun 01 '25 00:06 github-actions[bot]

This pull request has been automatically closed because there has been no more activity in the 14 days since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

github-actions[bot] avatar Jun 16 '25 00:06 github-actions[bot]