spark-operator
spark-operator copied to clipboard
MutatingWebhook stop working after a couple of days
Hi,
I got a strange problem. After a couple of days(maybe 10 days to 20 days), the MutatingWebhook stopped working. The 'spark-operator' pod was still there without any restarts(I can see restarts is 0). My spark app was not able to restart. However, when I manually restart the 'spark-operator' deployment, then my spark app restarted successfully. This issue happened twice. I just wonder what the potential causes. We did not make any changes to cluster that will affect spark-operator as far as I know. Thanks.
Cheng
Do you have access to the api server logs?
Thanks @liyinan926 for quick reply. How do I get api server logs. I have access to the server.
Typically it's in a path (sorry I don't remember exactly where) in the master node. Depending on the k8s environment, it may or may not be possible to access that.
We are using AWS EKS. I will try to search. Thanks for the pointer.
No problem. You might also want to ask for help/suggestion in the Slack channel.
Thanks a lot. I will do that.
@liyinan926 Seems I can't join the slack channel. Do I need an invitation to join? Thanks.
Ok, I checked the EKS cluster "Control Plane Logging" section, it shows API server logging is disabled. So there won't be any log available. I will try to enable it for the next time. Is the 'API server' log all I need?
@cwei-bgl You said that the mutating webhook stopped working... to what effect? We are observing that that driver pods are still scheduled, but are missing those aspects that are added by the webhook (volume mounts, tolerations). See https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1004 We are also an EKS and had been using the spark operator with EKS 1.15 for months without problems. Now we are migration to EKS 1.18, but hoped to keep the operator on 1.1.0-2.4.5 (cause we cannot do the migration to Spark 3 just now) and are seeing this issue. We still could try the most recent 1.1.x, but I doubt that this will change anything.
@jgoeres We are on EKS 1.16. We used to run spark 2.4.5 but upgraded to spark 3 about two months ago. At the same time, I also upgraded spark-operator to the latest - helm chart spark-operator-1.0.6 v1beta2-1.2.0-3.0.0 . It seemed we started to experience this issue since we upgraded. The following is what I have just got after 2 days running, which has no problem for now.
Status:
Application State:
State: RUNNING
Driver Info:
Pod Name: data-spark-driver
Web UI Address: 172.20.188.135:4040
Web UI Port: 4040
Web UI Service Name: data-spark-ui-svc
Execution Attempts: 8
Executor State:
data-b6341777d3773d64-exec-1: FAILED
data-b6341777d3773d64-exec-2: FAILED
data-b6341777d3773d64-exec-3: RUNNING
data-b6341777d3773d64-exec-4: RUNNING
Last Submission Attempt Time: 2021-02-24T09:55:28Z
Spark Application Id: spark-b2a92e01be06473a806e695314493266
Submission Attempts: 1
Submission ID: ab6447f8-bb1f-4420-b150-2a454f78f377
Termination Time: <nil>
Events: <none>
@weicheng113 which version of the operator were you on that didn't have the issue?
@liyinan926 Thanks. I don't remember clearly. But I think it should be v1beta2-1.1.2-2.4.5, as we used to run on spark 2.4.5.
PS: @liyinan926 @jgoeres to clarify @weicheng113 and @cwei-bgl are the same person :)
Not directly related to solving issues here, but the right direction IMO for at least Spark 3.0 is to move away from the webhook, but relying on the pod template feature in Spark 3.x. The operator can generate such templates internally and use those when submitting Spark applications. This requires a good amount of work though.
@liyinan926 That is great news. We can wait for that. Our app does not have an urgent real-time processing requirement. We have an alert system when we see there are too many messages in the queue. We can always manually restart for now and we have implemented spark checkpointing, so it is ok.
@weicheng113 sounds good, will file an issue to track that.
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1176
We did not change the operator version at all, so I suspect that some changes between EKS 1.15 and 1.18 is causing this.
@weicheng113 You say that you are on 1.16 - have you been using that version all the time or did you upgrade to that at some point? Has it been working OK initially on 1.16 and now suddenly breaks?
I am asking because AWS will rollout K8s patch version updates without prior warning, you can see that when you check the "platform version" of your cluster in the console This is the "platform version" history for EKS 1.16:
https://docs.aws.amazon.com/eks/latest/userguide/platform-versions.html#platform-versions-1.16
As you can see, they sometimes only add features to EKS itself, while at other times they also update the K8s version. EKS 1.16 startet on K8s 1.16.8, then went to 1.16.13 with platform version eks.3, and to 1.16.15 with eks.4.
The reason why I am asking so specifically for this is that the whole topic reminds me of a similar issue we had before we went into production, where an K8s patch update broke Spark:
https://issues.apache.org/jira/browse/SPARK-28921
People on EKS, AKS and GKE suddenly experienced Spark apps failing that worked fine before.
Our old version is currently still running fine on EKS 1.15, platform version eks.5 (i.e., Kubernetes 1.15.12), maybe whatever change was made to either K8s or EKS does not affect that EKS version? We are seeing this on EKS 1.18, platform version eks.3 (i.e., K8s 1.18.9)
To see if it is the Kubernetes version that is the cause, I deployed our app into a Minikube with Kubernetes 1.18.9 - so far, the issue hasn't surfaced there, but naturally, Minikube is very different from EKS.
Anyway, since you said that it worked with 1.1.2-2.4.5, I will try that one and see if this changes anything.
I second the idea to move away from the webhook approach on perspective, but for us this whole topic is really urgent - we have to update our EKS version, as EKS 1.15 will go out of support soon. Since the webhook seems to handle both volumes and tolerations (and we need both), we cannot do without it. So I would appreciate it if someone with more insight into the whole topic could look into this.
@jgoeres I can't be 100% sure if the EKS cluster has been upgraded during the whole period, as one other member in my team can apply the change also. If you really want, I can ask him if he did any upgrade during the period.
@weicheng113
I guess it makes sense to check if an update was done. But I am not primarily refering to an EKS/Kubernetes "minor version" update (e.g., from EKS 1.15 to EKS 1.16) but about an update of the EKS platform version (e.g., from EKS 1.16 eks.3 to eks.4). These updates are AFAIK only affecting the control plane and the end user cannot control these - AWS decides when to update the clusters. So if you remember the platform version you started with, and check which one you have now, this might also be helpful.
@jgoeres Ok, I will talk to my team member next week when we get back to work, and get you back.
@jgoeres I confirmed my my team member. No, we did not upgrade EKS during the period between before upgrading spark-operator and up till now.
@weicheng113 I wonder how you are installing the operator? Using the YAML deployment files or the helmchart? Are you using a GitOps tool like ArgoCD?
@jgoeres I am using helm chart.
@weicheng113 Check my last entry in https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1179, where I describe what I am quite sure is causing this.
looks like overriding helm hook to do not run webhook-cleanup-job
on helm upgrade
(and avoid cert regenaration without spark-operator pods restart) helps to mitigate the issue:
webhook:
enable: true
cleanupAnnotations:
"helm.sh/hook": pre-delete
"helm.sh/hook-delete-policy": hook-succeeded
Fwiw, here's how to join slack: https://slack.k8s.io/
Hi,
This issue happens when mutatingWebhook certificate of the Spark-operator and Spark-operator certificate mismatch. Due to certificates being different, webhook fails to update Spark resources and inject configuration.
As a temporary fix I've implemented a livenessProbe for the Deployment so it checks if mutating webhook has a mismatch and restarts container to refresh certificates and match them together. Seems to be working for now
livenessProbe:
initialDelaySeconds: 1
periodSeconds: 1
failureThreshold: 1
exec:
command:
- sh
- -c
- |
set -e
curl -iks -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
https://kubernetes.default.svc/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/{{ include "spark-operator.fullname" . }}-webhook-config \
| grep -o '"caBundle": "[^"]*"' \
| awk -F'"' '{print $4}' \
| base64 -d > /tmp/expected_ca_bundle.crt
expected_ca_bundle=$(cat /etc/webhook-certs/ca-cert.pem)
actual_ca_bundle=$(cat /tmp/expected_ca_bundle.crt)
if [ "$expected_ca_bundle" != "$actual_ca_bundle" ]; then
exit 1
fi
@artur-bolt I have test the provided temporary solution but I am not sure to understand when this mismatch happens ?
- Restart of the spark-operator pod ?
- New spark-operator pod (killed the previous one) ?
Did you manage to understand the sequence that triggers this issue ?
Thanks