volcano icon indicating copy to clipboard operation
volcano copied to clipboard

remove deletedJobs queue in cache model

Open Wang-Kai opened this issue 1 year ago • 8 comments

To troubleshoot this issue, my colleague and I worked until 3 AM. I sincerely hope that this fix will be merged into the community repository.

Background

In the controller component, the cache module has a separate deletedJobs queue specifically for handling job deletions. The job-controller also has a queue to process pod and job events. In edge cases, a situation may occur where a job is deleted from etcd but not from the cache. This leads to pods being created by the job-controller and immediately deleted by the gc-controller, causing a repetitive loop until the controller is restarted.

  1. Received a pod update event and started processing the pod update (processing thread thread_1).
  2. Received a job delete event and started processing the job deletion (processing thread thread_2).
  3. Time 3: thread_2 sets jobInfo.Job to nil and adds jobInfo to the deletedJobs queue.
  4. Time 4: thread_1 completes the job status update and assigns a value to jobInfo.Job.
  5. Time 5: A worker picks up jobInfo from the deletedJobs queue and finds that jobTerminated(job) == false, leading to repeated retries without success.
  6. Time 6: Eventually, the job no longer exists in etcd but still exists in the cache.
  7. Time 7: The gc-controller notices that the job no longer exists, so it cascades and deletes the pod.
  8. Time 8: The job-controller detects that the pod was deleted, but since the job still exists in the cache, it proceeds to recreate the pod.

Proposed Solution

Deprecate the deletedJobs queue and use the queue within the job-controller uniformly. Add an IsDeleteJobAction field to the Request struct to flag job delete events. When the job-controller's processing function detects IsDeleteJobAction=true, it will directly delete the job from the cache.

Fixes #3601 Fixes #3357

Wang-Kai avatar Aug 21 '24 03:08 Wang-Kai

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Feb 01 '25 01:02 stale[bot]

PTAL @Monokaix @JesseStutler

hwdef avatar Feb 06 '25 02:02 hwdef

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Apr 25 '25 23:04 stale[bot]

still need. Please fix the CI and rebase the code.

hwdef avatar May 04 '25 18:05 hwdef

/cc

JesseStutler avatar Nov 03 '25 03:11 JesseStutler

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign wangyang0616 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Nov 05 '25 12:11 volcano-sh-bot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign wangyang0616 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Nov 05 '25 12:11 volcano-sh-bot

@Wang-Kai: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Nov 19 '25 01:11 volcano-sh-bot