spark-operator MutatingWebhook stop working after a couple of days

Hi,

I got a strange problem. After a couple of days(maybe 10 days to 20 days), the MutatingWebhook stopped working. The 'spark-operator' pod was still there without any restarts(I can see restarts is 0). My spark app was not able to restart. However, when I manually restart the 'spark-operator' deployment, then my spark app restarted successfully. This issue happened twice. I just wonder what the potential causes. We did not make any changes to cluster that will affect spark-operator as far as I know. Thanks.

Cheng

Feb 24 '21 02:02 cwei-bgl

Do you have access to the api server logs?

Feb 24 '21 02:02 liyinan926

Thanks @liyinan926 for quick reply. How do I get api server logs. I have access to the server.

Feb 24 '21 02:02 cwei-bgl

Typically it's in a path (sorry I don't remember exactly where) in the master node. Depending on the k8s environment, it may or may not be possible to access that.

Feb 24 '21 02:02 liyinan926

We are using AWS EKS. I will try to search. Thanks for the pointer.

Feb 24 '21 02:02 cwei-bgl

No problem. You might also want to ask for help/suggestion in the Slack channel.

Feb 24 '21 02:02 liyinan926

Thanks a lot. I will do that.

Feb 24 '21 03:02 cwei-bgl

@liyinan926 Seems I can't join the slack channel. Do I need an invitation to join? Thanks.

Feb 24 '21 03:02 weicheng113

Ok, I checked the EKS cluster "Control Plane Logging" section, it shows API server logging is disabled. So there won't be any log available. I will try to enable it for the next time. Is the 'API server' log all I need?

Feb 24 '21 03:02 weicheng113

@cwei-bgl You said that the mutating webhook stopped working... to what effect? We are observing that that driver pods are still scheduled, but are missing those aspects that are added by the webhook (volume mounts, tolerations). See https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1004 We are also an EKS and had been using the spark operator with EKS 1.15 for months without problems. Now we are migration to EKS 1.18, but hoped to keep the operator on 1.1.0-2.4.5 (cause we cannot do the migration to Spark 3 just now) and are seeing this issue. We still could try the most recent 1.1.x, but I doubt that this will change anything.

Feb 25 '21 13:02 jgoeres

@jgoeres We are on EKS 1.16. We used to run spark 2.4.5 but upgraded to spark 3 about two months ago. At the same time, I also upgraded spark-operator to the latest - helm chart spark-operator-1.0.6 v1beta2-1.2.0-3.0.0 . It seemed we started to experience this issue since we upgraded. The following is what I have just got after 2 days running, which has no problem for now.

Status:
  Application State:
    State:  RUNNING
  Driver Info:
    Pod Name:             data-spark-driver
    Web UI Address:       172.20.188.135:4040
    Web UI Port:          4040
    Web UI Service Name:  data-spark-ui-svc
  Execution Attempts:     8
  Executor State:
    data-b6341777d3773d64-exec-1:  FAILED
    data-b6341777d3773d64-exec-2:  FAILED
    data-b6341777d3773d64-exec-3:  RUNNING
    data-b6341777d3773d64-exec-4:  RUNNING
  Last Submission Attempt Time:    2021-02-24T09:55:28Z
  Spark Application Id:            spark-b2a92e01be06473a806e695314493266
  Submission Attempts:             1
  Submission ID:                   ab6447f8-bb1f-4420-b150-2a454f78f377
  Termination Time:                <nil>
Events:                            <none>

Feb 25 '21 21:02 weicheng113

@weicheng113 which version of the operator were you on that didn't have the issue?

Feb 25 '21 21:02 liyinan926

@liyinan926 Thanks. I don't remember clearly. But I think it should be v1beta2-1.1.2-2.4.5, as we used to run on spark 2.4.5.

PS: @liyinan926 @jgoeres to clarify @weicheng113 and @cwei-bgl are the same person :)

Feb 25 '21 21:02 weicheng113

Not directly related to solving issues here, but the right direction IMO for at least Spark 3.0 is to move away from the webhook, but relying on the pod template feature in Spark 3.x. The operator can generate such templates internally and use those when submitting Spark applications. This requires a good amount of work though.

Feb 25 '21 21:02 liyinan926

@liyinan926 That is great news. We can wait for that. Our app does not have an urgent real-time processing requirement. We have an alert system when we see there are too many messages in the queue. We can always manually restart for now and we have implemented spark checkpointing, so it is ok.

Feb 25 '21 21:02 weicheng113

@weicheng113 sounds good, will file an issue to track that.

Feb 25 '21 21:02 liyinan926

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1176

Feb 25 '21 22:02 liyinan926

We did not change the operator version at all, so I suspect that some changes between EKS 1.15 and 1.18 is causing this.

@weicheng113 You say that you are on 1.16 - have you been using that version all the time or did you upgrade to that at some point? Has it been working OK initially on 1.16 and now suddenly breaks?

I am asking because AWS will rollout K8s patch version updates without prior warning, you can see that when you check the "platform version" of your cluster in the console This is the "platform version" history for EKS 1.16:

https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html#platform-versions-1.16

As you can see, they sometimes only add features to EKS itself, while at other times they also update the K8s version. EKS 1.16 startet on K8s 1.16.8, then went to 1.16.13 with platform version eks.3, and to 1.16.15 with eks.4.

The reason why I am asking so specifically for this is that the whole topic reminds me of a similar issue we had before we went into production, where an K8s patch update broke Spark:

https://issues.apache.org/jira/browse/SPARK-28921

People on EKS, AKS and GKE suddenly experienced Spark apps failing that worked fine before.

Our old version is currently still running fine on EKS 1.15, platform version eks.5 (i.e., Kubernetes 1.15.12), maybe whatever change was made to either K8s or EKS does not affect that EKS version? We are seeing this on EKS 1.18, platform version eks.3 (i.e., K8s 1.18.9)

To see if it is the Kubernetes version that is the cause, I deployed our app into a Minikube with Kubernetes 1.18.9 - so far, the issue hasn't surfaced there, but naturally, Minikube is very different from EKS.

Anyway, since you said that it worked with 1.1.2-2.4.5, I will try that one and see if this changes anything.

I second the idea to move away from the webhook approach on perspective, but for us this whole topic is really urgent - we have to update our EKS version, as EKS 1.15 will go out of support soon. Since the webhook seems to handle both volumes and tolerations (and we need both), we cannot do without it. So I would appreciate it if someone with more insight into the whole topic could look into this.

Feb 26 '21 07:02 jgoeres

@jgoeres I can't be 100% sure if the EKS cluster has been upgraded during the whole period, as one other member in my team can apply the change also. If you really want, I can ask him if he did any upgrade during the period.

Feb 26 '21 08:02 weicheng113

@weicheng113
I guess it makes sense to check if an update was done. But I am not primarily refering to an EKS/Kubernetes "minor version" update (e.g., from EKS 1.15 to EKS 1.16) but about an update of the EKS platform version (e.g., from EKS 1.16 eks.3 to eks.4). These updates are AFAIK only affecting the control plane and the end user cannot control these - AWS decides when to update the clusters. So if you remember the platform version you started with, and check which one you have now, this might also be helpful.

Feb 26 '21 08:02 jgoeres

@jgoeres Ok, I will talk to my team member next week when we get back to work, and get you back.

Feb 26 '21 08:02 weicheng113

@jgoeres I confirmed my my team member. No, we did not upgrade EKS during the period between before upgrading spark-operator and up till now.

Mar 01 '21 23:03 cwei-bgl

@weicheng113 I wonder how you are installing the operator? Using the YAML deployment files or the helmchart? Are you using a GitOps tool like ArgoCD?

Mar 02 '21 06:03 jgoeres

@jgoeres I am using helm chart.

Mar 03 '21 01:03 weicheng113

@weicheng113 Check my last entry in https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1179, where I describe what I am quite sure is causing this.

Mar 04 '21 10:03 jgoeres

looks like overriding helm hook to do not run webhook-cleanup-job on helm upgrade (and avoid cert regenaration without spark-operator pods restart) helps to mitigate the issue:

webhook:
  enable: true
    cleanupAnnotations:
       "helm.sh/hook": pre-delete
       "helm.sh/hook-delete-policy": hook-succeeded

Mar 25 '21 11:03 korjek

Fwiw, here's how to join slack: https://slack.k8s.io/

Apr 20 '21 17:04 jsoref

Hi,

This issue happens when mutatingWebhook certificate of the Spark-operator and Spark-operator certificate mismatch. Due to certificates being different, webhook fails to update Spark resources and inject configuration.

As a temporary fix I've implemented a livenessProbe for the Deployment so it checks if mutating webhook has a mismatch and restarts container to refresh certificates and match them together. Seems to be working for now

livenessProbe:
  initialDelaySeconds: 1
  periodSeconds: 1
  failureThreshold: 1
  exec:
    command:
      - sh
      - -c
      - |
        set -e
        curl -iks -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
          https://kubernetes.default.svc/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/{{ include "spark-operator.fullname" . }}-webhook-config \
          | grep -o '"caBundle": "[^"]*"' \
          | awk -F'"' '{print $4}' \
          | base64 -d > /tmp/expected_ca_bundle.crt
        expected_ca_bundle=$(cat /etc/webhook-certs/ca-cert.pem)
        actual_ca_bundle=$(cat /tmp/expected_ca_bundle.crt)
        if [ "$expected_ca_bundle" != "$actual_ca_bundle" ]; then
          exit 1
        fi

Jul 06 '23 07:07 artur-bolt

@artur-bolt I have test the provided temporary solution but I am not sure to understand when this mismatch happens ?

Restart of the spark-operator pod ?
New spark-operator pod (killed the previous one) ?

Did you manage to understand the sequence that triggers this issue ?

Thanks

Feb 29 '24 11:02 jcdauchy-moodys

spark-operator spark-operator copied to clipboard

MutatingWebhook stop working after a couple of days

spark-operator
spark-operator copied to clipboard