spark-operator
spark-operator copied to clipboard
Intermittent Sidecar Injection Failure for SparkApplication Resources
I am experiencing intermittent failures with the sidecar container injection for SparkApplication
resources managed by the Spark Operator. While the sidecar containers are successfully injected and created in approximately 3 out of 5 instances, there are cases where the injection fails without clear errors in the logs or events related to the sidecar creation process.
Environment
- Spark Operator Version: latest version
- Kubernetes Version: 1.25
- Cloud Provider: AWS - EKS
- Installation Method: Helm chart
Steps to Reproduce
- Deploy the Spark Operator with mutating admission webhooks enabled.
- Create a
SparkApplication
manifest including specifications for a sidecar container. - Apply the
SparkApplication
manifest usingkubectl apply -f sparkapp.yaml
. - Observe the creation of Spark application pods and the intermittent absence of the specified sidecar containers.
Expected Behavior
Every instance of SparkApplication
resources should result in the creation of Spark application pods with the specified sidecar containers injected consistently.
Actual Behavior
Sidecar containers are only being injected into the Spark application pods in approximately 3 out of 5 attempts. The failures do not coincide with clear errors in the Spark Operator logs, Kubernetes events, or mutating webhook configurations.
Troubleshooting Steps Undertaken
- Reviewed Spark Operator logs for errors or warnings related to sidecar injection.
- Checked mutating webhook configurations for correctness.
- Inspected Kubernetes events for any signs of failed operations or errors during pod creation.
- Validated that
SparkApplication
manifests are correctly formatted and consistent across successful and unsuccessful attempts. - Observed no clear patterns in the failures regarding specific nodes, times, or cluster conditions.
I am seeking guidance on further troubleshooting steps or configurations I might have overlooked. Additionally, any insights into known issues, workarounds, or fixes would be greatly appreciated.
Do the other pods not even show up or are they just not functioning?
I don't know if this will help, but if you can use k8s 1.28 or 1.29, there is a new sidecar lifecycle feature (on by default in 1.29) [1] that might help orchestrate startup and shutdown. I have not been able to try it yet, but am interested for proper graceful termination for pods. We experience issues with autoscalers (not in Spark) where the SIGKILL of the istio container brings down the whole pod before the specified graceful termination period.
Check out feature gates PodReadyToStartContainersCondition
and SidecarContainers
[2].
[1] https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/#sidecar-containers-and-pod-lifecycle
[2] https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
Thank you for your response, @jkleckner.
After reviewing the API server logs, I identified an intermittent error that error occurs during the instances when sidecar creation fails:
Failed calling webhook, failing open webhook.sparkoperator.k8s.io: failed calling webhook "webhook.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook.spark-operator.svc:443/webhook?timeout=30s": tls: failed to verify certificate: x509: certificate is valid for metrics-server.kube-system.svc, not spark-operator-webhook.spark-operator.svc
Looks like the issue lies in the TLS certificate verification process. The error indicates that the TLS certificate presented by the webhook server was not valid for the domain name spark-operator-webhook.spark-operator.svc
that the Kubernetes API server tried to connect to. Instead, the certificate is valid for metrics-server.kube-system.svc
, which is a different service within the cluster.
Given that this error occurs only sometimes, and the sidecar is successfully created most of the time, the configuration and certificates are likely correct. I'm confused about the underlying cause. It doesn't seem to be a straightforward TLS misconfiguration, as the issue isn't consistent. I'm still troubleshooting to find the root cause of the issue.
Is it possible this might get resolved by https://github.com/kubeflow/spark-operator/pull/2083?