spark-operator Intermittent Sidecar Injection Failure for SparkApplication Resources

I am experiencing intermittent failures with the sidecar container injection for SparkApplication resources managed by the Spark Operator. While the sidecar containers are successfully injected and created in approximately 3 out of 5 instances, there are cases where the injection fails without clear errors in the logs or events related to the sidecar creation process.

Environment

Spark Operator Version: latest version
Kubernetes Version: 1.25
Cloud Provider: AWS - EKS
Installation Method: Helm chart

Steps to Reproduce

Deploy the Spark Operator with mutating admission webhooks enabled.
Create a SparkApplication manifest including specifications for a sidecar container.
Apply the SparkApplication manifest using kubectl apply -f sparkapp.yaml.
Observe the creation of Spark application pods and the intermittent absence of the specified sidecar containers.

Expected Behavior

Every instance of SparkApplication resources should result in the creation of Spark application pods with the specified sidecar containers injected consistently.

Actual Behavior

Sidecar containers are only being injected into the Spark application pods in approximately 3 out of 5 attempts. The failures do not coincide with clear errors in the Spark Operator logs, Kubernetes events, or mutating webhook configurations.

Troubleshooting Steps Undertaken

Reviewed Spark Operator logs for errors or warnings related to sidecar injection.
Checked mutating webhook configurations for correctness.
Inspected Kubernetes events for any signs of failed operations or errors during pod creation.
Validated that SparkApplication manifests are correctly formatted and consistent across successful and unsuccessful attempts.
Observed no clear patterns in the failures regarding specific nodes, times, or cluster conditions.

I am seeking guidance on further troubleshooting steps or configurations I might have overlooked. Additionally, any insights into known issues, workarounds, or fixes would be greatly appreciated.

Feb 14 '24 23:02 potlurip

Do the other pods not even show up or are they just not functioning?

I don't know if this will help, but if you can use k8s 1.28 or 1.29, there is a new sidecar lifecycle feature (on by default in 1.29) [1] that might help orchestrate startup and shutdown. I have not been able to try it yet, but am interested for proper graceful termination for pods. We experience issues with autoscalers (not in Spark) where the SIGKILL of the istio container brings down the whole pod before the specified graceful termination period.

Check out feature gates PodReadyToStartContainersCondition and SidecarContainers [2].

[1] https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/#sidecar-containers-and-pod-lifecycle

[2] https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/

Feb 18 '24 01:02 jkleckner

Thank you for your response, @jkleckner.

After reviewing the API server logs, I identified an intermittent error that error occurs during the instances when sidecar creation fails:

Failed calling webhook, failing open webhook.sparkoperator.k8s.io: failed calling webhook "webhook.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook.spark-operator.svc:443/webhook?timeout=30s": tls: failed to verify certificate: x509: certificate is valid for metrics-server.kube-system.svc, not spark-operator-webhook.spark-operator.svc

Looks like the issue lies in the TLS certificate verification process. The error indicates that the TLS certificate presented by the webhook server was not valid for the domain name spark-operator-webhook.spark-operator.svc that the Kubernetes API server tried to connect to. Instead, the certificate is valid for metrics-server.kube-system.svc, which is a different service within the cluster.

Given that this error occurs only sometimes, and the sidecar is successfully created most of the time, the configuration and certificates are likely correct. I'm confused about the underlying cause. It doesn't seem to be a straightforward TLS misconfiguration, as the issue isn't consistent. I'm still troubleshooting to find the root cause of the issue.

Feb 21 '24 03:02 potlurip

Is it possible this might get resolved by https://github.com/kubeflow/spark-operator/pull/2083?

Jul 10 '24 11:07 YanivKunda

spark-operator spark-operator copied to clipboard

Intermittent Sidecar Injection Failure for SparkApplication Resources

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Troubleshooting Steps Undertaken

spark-operator
spark-operator copied to clipboard