volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Add doc to note how to cleanup `Volcano` completely

Open Yikun opened this issue 3 years ago • 8 comments

What would you like to be added:

Add doc to note how to cleanup Volcano completely

I0312 06:48:38.683192       1 event.go:291] "Event occurred" object="volcano-system/volcano-admission-54b4798bff" kind="ReplicaSet" apiVersion="apps/v1" type="Warning" reason="FailedCreate" message="Error creating: Internal error occurred: failed calling webhook \"mutatepod.volcano.sh\": failed to call webhook: Post \"https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s\": dial tcp xxx.xxx.xxx.xxx:443: connect: connection refused"

Why is this needed:

kubectl apply -f ./installer/volcano-development-arm64.yaml
kubectl delete -f ./installer/volcano-development-arm64.yaml

Use apply and delete reinstall volcano, we still need to cleanup validatingwebhookconfigurations and mutatingwebhookconfigurations, then re-apply:

k delete validatingwebhookconfigurations volcano-admission-service-jobs-validate volcano-admission-service-pods-validate volcano-admission-service-queues-validate
k delete mutatingwebhookconfigurations volcano-admission-service-jobs-mutate volcano-admission-service-podgroups-mutate volcano-admission-service-pods-mutate volcano-admission-service-queues-mutate

https://github.com/volcano-sh/volcano/issues/2102#issuecomment-1072295632: Supplyment: volcano-admission-init will create a secret which cannot be cleaned thoroughly.

Yikun avatar Mar 12 '22 23:03 Yikun

For some reason I still see some events like:

Error creating: Internal error occurred: failed calling webhook "mutatepod.volcano.sh": Post "https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s": x509: certificate has expired or is not yet valid: current time 2022-03-19T07:53:35Z is after 2021-07-03T10:00:41Z

Even after deletion of volcano and validatingwebhookconfigurations , mutatingwebhookconfigurations

LeonardAukea avatar Mar 19 '22 09:03 LeonardAukea

Here are some improvements I need about webhook:

  1. Instead of using code to generate webhook, it is written in yaml and deployed with yaml.This makes it easy to change the scope of application of the webhook and to ignore some pods
  2. If possible, narrow the scope of application of webhook, currently intercepting all pods, once volcano has a problem, the entire cluster cannot create pods

hwdef avatar Apr 01 '22 08:04 hwdef

Here are some improvements I need about webhook:

  1. Instead of using code to generate webhook, it is written in yaml and deployed with yaml.This makes it easy to change the scope of application of the webhook and to ignore some pods
  2. If possible, narrow the scope of application of webhook, currently intercepting all pods, once volcano has a problem, the entire cluster cannot create pods

agree with you...

For the first you mention: maybe the cert generate is difficult to generate in yaml. i understand why we use job generate now, it easy for user to install, but not considered how graceful clean them , my suggestion is make a separated job task to them as the installation does...

For the second, i think the it maybe influence the whole design because of pod resource's usage caculation i think ..not a very simple problem....(i'm not very very familiar with it now...) All in all, it's the volcano HA can avoid this.......

whybeyoung avatar Apr 05 '22 05:04 whybeyoung

I think the simplest solution for now is to add a delete hook(Job) to delete these webhooks individually :)

shinytang6 avatar Apr 28 '22 12:04 shinytang6

https://kubernetes.io/zh/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-namespaceselector

We should exclude pods of kube-system and volcano-system

hwdef avatar Apr 29 '22 06:04 hwdef

For future users with this issue, I resolved it with:

kubectl delete validatingwebhookconfigurations volcano-admission-service-pods-validate volcano-admission-service-jobs-validate volcano-admission-service-queues-validate
kubectl delete mutatingwebhookconfigurations volcano-admission-service-podgroups-mutate volcano-admission-service-jobs-mutate volcano-admission-service-pods-mutate volcano-admission-service-queues-mutate

pietermarsman avatar May 03 '22 12:05 pietermarsman

For future users with this issue, I resolved it with:

kubectl delete validatingwebhookconfigurations volcano-admission-service-pods-validate volcano-admission-service-jobs-validate volcano-admission-service-queues-validate
kubectl delete mutatingwebhookconfigurations volcano-admission-service-podgroups-mutate volcano-admission-service-jobs-mutate volcano-admission-service-pods-mutate volcano-admission-service-queues-mutate

I have deleted validatingwebhookconfigurations and mutatingwebhookconfigurations as you mentioned, but when i run "kubectl apply -f mpi-example.yaml", it only outputs "job.batch.volcano.sh/lm-mpi-job configured", run "kubectl get pod", it outputs "No resources found in default namespace". How to solve this problem?

kongjibai avatar May 05 '22 10:05 kongjibai

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Aug 10 '22 03:08 stale[bot]

fix in https://github.com/volcano-sh/volcano/pull/2346 /close

hwdef avatar Aug 12 '22 05:08 hwdef

@hwdef: Closing this issue.

In response to this:

fix in https://github.com/volcano-sh/volcano/pull/2346 /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Aug 12 '22 05:08 volcano-sh-bot