volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Admission controller can't restart after crashing due to volcano webhook failure policy: fail

Open djwhatle opened this issue 2 years ago • 3 comments

What happened:

The volcano-admission Pod crashed and tried to start back up, but was denied admission because there is a self-referential dependency for this Pod to be admitted.

What you expected to happen:

volcano-admission Pod can crash and restart

How to reproduce it (as minimally and precisely as possible):

# First scale down the volcano-admission Pod to simulate a Pod crash or OOM event
kubectl scale deployment volcano-admission -n volcano-system --replicas=0

# Now wait for the volcano-admission pod to stop
kubectl scale deployment volcano-admission -n volcano-system --replicas=1

# Notice the volcano-admission pod won't come back into a running state
kubectl get pod -n volcano-system                                                                                                                                                                                                        

NAME                                   READY   STATUS      RESTARTS   AGE
volcano-admission-init-rgdn2           0/1     Completed   0          54m
volcano-controllers-7c6b5c4f4b-cxvg5   1/1     Running     0          54m
volcano-scheduler-7f9d9984fd-zz52w     1/1     Running     0          54m

Anything else we need to know?:

Environment:

  • Volcano Version: v1.5.1
  • Kubernetes version (use kubectl version): 1.21
  • Cloud provider or hardware configuration: GKE
  • Install tools: In a kubeflow env

I have a patch for this: simply set the retrypolicy for webhooks to ignore instead of fail when the webhook is not available. The admission controller should be multi-replica or else the failurePolicy on the webhook should not be set to fail.

djwhatle avatar May 17 '22 20:05 djwhatle

Fix implemented in #2245

djwhatle avatar May 17 '22 20:05 djwhatle

Implementing an alternate fix in an upcoming PR. It sounds like there are some potentially harmful side-effects (panic in volcano controller) if webhook failurePolicy: ignore is set.

Arguably the controllers should be resilient to webhook failure without panic but for practical purposes, will plan to open a PR to set the webhooks to ignore all Pod admissions in volcano-system and kube-system namespaces.

djwhatle avatar May 19 '22 15:05 djwhatle

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Sep 08 '22 22:09 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Nov 12 '22 09:11 stale[bot]