spark-operator
spark-operator copied to clipboard
sparksubmit operator fails in ipv6+istio environment
Running Spark 3.1.1 on spark-operator with Istio 1.5.7 and Ipv6 environment. After submitting a job I am getting below exception:
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Invalid proxy server configuration
at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:201)
at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:67)
at org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:100)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$2(KubernetesClientApplication.scala:207)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2610)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.MalformedURLException: For input string: "105::1:443"
at java.net.URL.
I believe the formatting for IPv6 is missing, it should be "https://[105::1]:443" But, how do I pass this in the "kind: SparkApplication" yaml? Is there any specific property which will overwrite the proxy ip dynamically?
I tried the HTTPS_PROXY, HTTP_PROXY, HTTP2_DISABLE env variables for driver and executor. I have also disabled istio sidecar injection, as I understand jobs dont work well with istio.
We need to add square brackets([]) while creating the master URL in submission.go. Update the method getMasterURL() in submission.go from /pkg/controller/sparkapplication/ as below
**return fmt.Sprintf("k8s://https://[%s]:%s", kubernetesServiceHost, kubernetesServicePort), nil**
We need to also add square brackets([]) in entrypoint.sh where we pass the $SPARK_EXECUTOR_POD_IP.
We're seeing the same issue on EKS with IPv6. Has there been any progress on this?
I've tried the fix suggested by @pm-nuance and it does get rid of the original error. (Available in this image: ghcr.io/valorl/spark-on-k8s-operator:upstream-ipv6)
However, now it throws some (seemingly) TLS-related error. Any suggestions?
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [spark] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:349)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:84)
at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:139)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: javax.net.ssl.SSLPeerUnverifiedException: Hostname fd00:10:96::1 not verified:
certificate: sha256/LPsv1hc3g6MBZxaVQ8orX1AMm6FAYpBEpvqCtftRzAY=
DN: CN=kube-apiserver
subjectAltNames: [fd00:10:96:0:0:0:0:1, fc00:f853:ccd:e793:0:0:0:2, 0:0:0:0:0:0:0:1, kind-control-plane, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, localhost]
at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:334)
at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:284)
at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:169)
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:135)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:41)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at okhttp3.RealCall.execute(RealCall.java:93)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:490)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:451)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:252)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:879)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:341)
... 14 more
Hello! We are in the same situation as @valorl
Regarding @pm-nuance 's comment,
We need to also add square brackets([]) in entrypoint.sh where we pass the $SPARK_EXECUTOR_POD_IP.
We didn’t set SPARK_APPLICATION_ID or SPARK_EXECUTER_POD_IP Do we need to set SPARK_APPLICATION_ID or SPARK_EXECUTER_POD_IP...? If so, how...?
Thank you!
created a PR to fix this (though, I disabled istio-injection)
- https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/1825
and could confirm this works when spark 3.4 is used.