volcano icon indicating copy to clipboard operation
volcano copied to clipboard

PodGroup may lost if you delete a pending vcjob and then recreate it immediately.

Open Christina935 opened this issue 2 years ago • 6 comments

What happened: If you delete a pending vcjob and then recreate it immediately, the podgroup of vcjob may lost.

What you expected to happen: Recreate a new vcjob and a new podgrop.

How to reproduce it (as minimally and precisely as possible): Recreate immediately after deleting the pending vcjob, repeat the above behavior twice, there will always be one time to lose podgroup.

Anything else we need to know?:

Environment:

  • Volcano Version: 1.2.0
  • Kubernetes version (use kubectl version): 1.20.15 / 1.18.0
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Christina935 avatar Mar 01 '22 02:03 Christina935

Thanks for your report. l am able to reproduce this issue, let me take a look )

shinytang6 avatar Mar 01 '22 07:03 shinytang6

I wanted to +1 this as I'm also seeing it. re-creating the job with a different name is the only way to get the podgroup to be created.

tFable avatar Mar 03 '22 15:03 tFable

Maybe this is helpful. I noticed that when I ran into this and I was creating jobs but no podgroups were created, the admission-controller was throwing the following error over and over again until the pod was deleted (and was automatically re-created)

I0303 18:19:12.594965       1 job_controller.go:324] Execute <SyncJob> on Job <kweiler/kweiler-test-job1d> in <Pending> by <*state.pendingState>.
I0303 18:19:12.594971       1 job_controller_actions.go:223] Starting to sync up Job <kweiler/kweiler-test-job1d>, current version 0
I0303 18:19:12.595001       1 job_controller_actions.go:164] Starting to initiate Job <kweiler/kweiler-test-job1d>
W0303 18:19:12.595011       1 job_controller_actions.go:744] Ignore task test-head priority class : priorityclass.scheduling.k8s.io "" not found
W0303 18:19:12.595018       1 job_controller_actions.go:744] Ignore task test-workers priority class : priorityclass.scheduling.k8s.io "" not found
I0303 18:19:12.599644       1 queue_controller_action.go:81] End sync queue q1.
I0303 18:19:12.599660       1 queue_controller.go:220] Finished syncing queue q1 (4.735513ms).
I0303 18:19:12.600388       1 event.go:282] Event(v1.ObjectReference{Kind:"Job", Namespace:"kweiler", Name:"kweiler-test-job1d", UID:"d658e059-27f9-4c92-8a75-b3aa6db2af6c", APIVersion:"batch.volcano.sh/v1alpha1", ResourceVersion:"1504379", FieldPath:""}): type: 'Warning' reason: 'PodGroupPending' PodGroup kweiler:kweiler-test-job1d unschedule,reason: 5/0 tasks in gang unschedulable: pod group is not ready, 5 minAvailable
I0303 18:19:37.823461       1 cache.go:370] Try to delete Job <kweiler/kweiler-test-job1d>
I0303 18:19:37.828671       1 cache.go:370] Try to delete Job <kweiler/kweiler-test-job1d>
I0303 18:19:37.836015       1 cache.go:370] Try to delete Job <kweiler/kweiler-test-job1d>
I0303 18:19:37.839223       1 cache.go:361] Job <kweiler/kweiler-test-job1d> was deleted.
I0303 18:19:37.839579       1 queue_controller.go:238] Begin execute SyncQueue action for queue q1, current status Open
I0303 18:19:37.839590       1 queue_controller_action.go:35] Begin to sync queue q1.
I0303 18:19:37.839822       1 cache.go:370] Try to delete Job </>
I0303 18:19:37.843856       1 queue_controller_action.go:81] End sync queue q1.
I0303 18:19:37.843867       1 queue_controller.go:220] Finished syncing queue q1 (4.294222ms).
I0303 18:19:37.845901       1 cache.go:361] Job </> was deleted.
E0303 18:19:37.901083       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.001280       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.100447       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.200680       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.301126       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.401337       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.500506       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.600772       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.701192       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.800844       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:38.901057       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-0" not found
E0303 18:19:39.001282       1 job_controller_actions.go:509] Failed to get pod kweiler/kweiler-test-job1d-test-head-0 pod "kweiler-test-job1d-test-head-

tFable avatar Mar 03 '22 18:03 tFable

I think this is same problem #2130

Jason-Liu-Dream avatar Jun 08 '22 07:06 Jason-Liu-Dream

I think this is same problem #2130

It may not be same problem, the problem in 2130 is probably because of dependsOn, and the volcano version used in this issue is 1.2, and there is no function of dependson yet.

hwdef avatar Jun 08 '22 08:06 hwdef

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Sep 08 '22 22:09 stale[bot]