spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

-webhook-fail-on-error can't be configured with the helm chart

Open Dlougach opened this issue 1 year ago • 5 comments

There is a pull request addressing this issue (#1685) but nobody has reviewed it.

Dlougach avatar Jan 09 '24 15:01 Dlougach

A workaround for the time being is to use postrender script in Helm that looks like:

yq '. |= (select(.kind == "Deployment") | .spec.template.spec.containers[0].args += "-webhook-fail-on-error=true")' -

Dlougach avatar Jan 10 '24 16:01 Dlougach

Hi @Dlougach thanks for this fix! Can you describe what the -webhook-fail-on-error=true arg does?

For context, on my deployment it looks like the webhook certs expire after sometime, and after that the spark operator creates invalid driver pods (missing pvc mount), and my driver pods start to fail. the spark-operator pod does not crash though, just logs TLS errors.

rolling the spark-operator deployment seems to resolve the issue. and my solution was to create a simple k8s cronjob that runs kubectl rollout restart but if this argument essentially restarts the pod, that would be way better! LMK!

domenicbove avatar Feb 02 '24 17:02 domenicbove

Can you describe what the -webhook-fail-on-error=true arg does?

It makes sure that if kubernetes API can't reach the webhook, then the pod creation will fail (the default behaviour is that it will run as is - in the configuration it was created with).

Dlougach avatar Feb 02 '24 17:02 Dlougach

Hmmm, that sounds helpful. I guess in my case where the webhook wasn't working and the pods were getting created incorrectly, but I really just need the regen certs script to run in that case. Like I'd hope this could be self healing. Any thoughts?

domenicbove avatar Feb 07 '24 20:02 domenicbove

@domenicbove, if its any help, I had a similar issue, and raising a PR here doesn't seem like it would do any good. Rather than fork the chart I was able to patch the value into the deployment, as we're using helmfile which has Kustomize support.

The root cause of our operator stopping working was a couple of things.

We had several deployments of the operator in diff namespaces and they were all overwriting the same webhook config. Previously running operators would start fail when deploying new ones. This was fixed by properly setting fullnameOverride so each webhook config was unique.

We also found that the helm hooks were causing issues when we deployed as they would get kicked off but the pods weren't rolled. This meant the running operator no longer had the correct certs and would start to fail. We fixed that by changing the hook values for init and cleanup to remove the pre-upgrade step and haven't had any issues since.

schaffino avatar Feb 09 '24 02:02 schaffino

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 25 '24 05:07 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Aug 14 '24 06:08 github-actions[bot]