remove deletedJobs queue in cache model
To troubleshoot this issue, my colleague and I worked until 3 AM. I sincerely hope that this fix will be merged into the community repository.
Background
In the controller component, the cache module has a separate deletedJobs queue specifically for handling job deletions. The job-controller also has a queue to process pod and job events. In edge cases, a situation may occur where a job is deleted from etcd but not from the cache. This leads to pods being created by the job-controller and immediately deleted by the gc-controller, causing a repetitive loop until the controller is restarted.
- Received a pod update event and started processing the pod update (processing thread thread_1).
- Received a job delete event and started processing the job deletion (processing thread thread_2).
- Time 3: thread_2 sets jobInfo.Job to nil and adds jobInfo to the deletedJobs queue.
- Time 4: thread_1 completes the job status update and assigns a value to jobInfo.Job.
- Time 5: A worker picks up jobInfo from the deletedJobs queue and finds that jobTerminated(job) == false, leading to repeated retries without success.
- Time 6: Eventually, the job no longer exists in etcd but still exists in the cache.
- Time 7: The gc-controller notices that the job no longer exists, so it cascades and deletes the pod.
- Time 8: The job-controller detects that the pod was deleted, but since the job still exists in the cache, it proceeds to recreate the pod.
Proposed Solution
Deprecate the deletedJobs queue and use the queue within the job-controller uniformly. Add an IsDeleteJobAction field to the Request struct to flag job delete events. When the job-controller's processing function detects IsDeleteJobAction=true, it will directly delete the job from the cache.
Fixes #3601 Fixes #3357
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
PTAL @Monokaix @JesseStutler
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
still need. Please fix the CI and rebase the code.
/cc
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign wangyang0616 for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign wangyang0616 for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@Wang-Kai: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.