volcano
volcano copied to clipboard
Admission controller can't restart after crashing due to volcano webhook failure policy: fail
What happened:
The volcano-admission
Pod crashed and tried to start back up, but was denied admission because there is a self-referential dependency for this Pod to be admitted.
What you expected to happen:
volcano-admission Pod can crash and restart
How to reproduce it (as minimally and precisely as possible):
# First scale down the volcano-admission Pod to simulate a Pod crash or OOM event
kubectl scale deployment volcano-admission -n volcano-system --replicas=0
# Now wait for the volcano-admission pod to stop
kubectl scale deployment volcano-admission -n volcano-system --replicas=1
# Notice the volcano-admission pod won't come back into a running state
kubectl get pod -n volcano-system
NAME READY STATUS RESTARTS AGE
volcano-admission-init-rgdn2 0/1 Completed 0 54m
volcano-controllers-7c6b5c4f4b-cxvg5 1/1 Running 0 54m
volcano-scheduler-7f9d9984fd-zz52w 1/1 Running 0 54m
Anything else we need to know?:
Environment:
- Volcano Version: v1.5.1
- Kubernetes version (use
kubectl version
): 1.21 - Cloud provider or hardware configuration: GKE
- Install tools: In a kubeflow env
I have a patch for this: simply set the retrypolicy for webhooks to ignore instead of fail when the webhook is not available. The admission controller should be multi-replica or else the failurePolicy on the webhook should not be set to fail.
Fix implemented in #2245
Implementing an alternate fix in an upcoming PR. It sounds like there are some potentially harmful side-effects (panic in volcano controller) if webhook failurePolicy: ignore
is set.
Arguably the controllers should be resilient to webhook failure without panic but for practical purposes, will plan to open a PR to set the webhooks to ignore all Pod admissions in volcano-system and kube-system namespaces.
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗