spark-operator
spark-operator copied to clipboard
Spark submit in operator fails
Hi all, I seem to be having some issues with the getting a spark application up and running: hittig issues like this:
21/06/04 07:42:53 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
21/06/04 07:42:53 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [my-ns] failed.
I have istio on the cluster hence I also tired the following settings with no avail:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: my-ns
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v3.1.1"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-v3.1.1.jar"
sparkVersion: "3.1.1"
batchScheduler: "volcano" #Note: the batch scheduler name must be specified with `volcano`
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.1.1
annotations:
sidecar.istio.io/inject: "false"
serviceAccount: default-editor
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.1.1
annotations:
sidecar.istio.io/inject: "false"
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
So somehow it seems like the application is not able to communicate with the kubernetes API. the default-editior sa has the following rules:
- apiGroups:
- sparkoperator.k8s.io
resources:
- sparkapplications
- scheduledsparkapplications
- sparkapplications/status
- scheduledsparkapplications/status
verbs:
- '*'
- apiGroups: [""]
resources: ["pods"]
verbs: ["*"]
- apiGroups: [""]
resources: ["services"]
verbs: ["*"]
i also added the authorizationpolicy to allow traffic for for webhook & operator:
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: spark-operator
namespace: spark
spec:
selector:
matchLabels:
app.kubernetes.io/name: spark-operator
rules:
- {}
If anyone has seen this before or has any valuable pointers. that would be much appreciated.
k8s: 1.19 version: "v1beta2-1.2.3-3.1.1" chart: 1.1.3 istio: 1.19
This PROTOCOL_ERROR
might also be a pointer towards the underlying issue:
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:349)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:84)
at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:139)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR
I've been trying to get DEBUG
logs out of driver in the hope of gaining more insight in the issue by setting:
spec:
sparkConfigMap: log4j-props
and generating the a cm using:
configMapGenerator:
- files:
- config/log4j.properties
name: log4j-props
generatorOptions:
disableNameSuffixHash: true
But I can't that to work either:
**Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties**
I found that for k8s 1.19.1, the kubernetes-client has to be of a version greater than >= 4.13.1 compatibility-matrix. Looking at the deps in 1.3 branch for spark I see the following:
https://github.com/apache/spark/blob/252dfd961189923e52304413036e0051346ee8e1/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L170
So, kubernetes-client 4.12.0 Is used. So to confirm, it seems that spark does not yet support k8s 1.19. would be great if someone your verify this.
Seems like an updated dept version was just added to master:
https://github.com/apache/spark/blob/6f2ffccb5e17b5ee92003c86b7ec03c5344105c3/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L128
The issue remains even after testing with building spark from master. Got debug logs set up as well for further details:
21/06/09 08:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/06/09 08:31:21 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
21/06/09 08:31:21 DEBUG Config: Trying to configure client from Kubernetes config...
21/06/09 08:31:21 DEBUG Config: Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
21/06/09 08:31:21 DEBUG Config: Trying to configure client from service account...
21/06/09 08:31:21 DEBUG Config: Found service account host and port: 100.64.0.1:443
21/06/09 08:31:21 DEBUG Config: Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt].
21/06/09 08:31:21 DEBUG Config: Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
21/06/09 08:31:21 DEBUG Config: Trying to configure client namespace from Kubernetes service account namespace path...
21/06/09 08:31:21 DEBUG Config: Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
21/06/09 08:31:21 DEBUG Config: Trying to configure client from Kubernetes config...
21/06/09 08:31:21 DEBUG Config: Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
21/06/09 08:31:21 DEBUG Config: Trying to configure client from service account...
21/06/09 08:31:21 DEBUG Config: Found service account host and port: 100.64.0.1:443
21/06/09 08:31:21 DEBUG Config: Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt].
21/06/09 08:31:21 DEBUG Config: Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
21/06/09 08:31:21 DEBUG Config: Trying to configure client namespace from Kubernetes service account namespace path...
21/06/09 08:31:21 DEBUG Config: Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
21/06/09 08:31:21 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
21/06/09 08:31:21 DEBUG UserGroupInformation: hadoop login
21/06/09 08:31:21 DEBUG UserGroupInformation: hadoop login commit
21/06/09 08:31:21 DEBUG UserGroupInformation: using local user:UnixPrincipal: root
21/06/09 08:31:21 DEBUG UserGroupInformation: Using user: "UnixPrincipal: root" with name root
21/06/09 08:31:21 DEBUG UserGroupInformation: User entry: "root"
21/06/09 08:31:21 DEBUG UserGroupInformation: UGI loginUser:root (auth:SIMPLE)
21/06/09 08:31:21 DEBUG HadoopDelegationTokenManager: Using the following builtin delegation token providers: hadoopfs, hbase.
21/06/09 08:31:21 INFO KubernetesClientUtils: Spark configuration files loaded from Some(/opt/spark/conf) : log4j.properties
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [my-ns] failed.
So the issue is related to https://github.com/fabric8io/kubernetes-client/issues/2212#issuecomment-628551315
In order to make it work. we had to add the following to the spark-operator, driver & executor:
env:
- name: HTTP2_DISABLE # https://github.com/fabric8io/kubernetes-client/issues/2212#issuecomment-628551315
value: "true"
https://github.com/fabric8io/kubernetes-client/issues/3176#issuecomment-853915701 is a good write-up of the root-cause.
In short, fabric8's kubernetes-client cannot communicate with a Kubernetes API server where the weak TLS cipher TLS_RSA_WITH_AES_256_GCM_SHA384 has been disabled. Disabling HTTP2 is a work-around.
@LeonardAukea could You try to run with KUBERNETES_TLS_VERSIONS
env variable set to TLSv1.2,TLSv1.3
?
I expect the kubernetes-client is currently only using TLS1.2 only and server/Istilo only expects secure ChipherSuites and newer try to use Ciphers for 1.3
Thanks @slachiewicz , setting KUBERNETES_TLS_VERSIONS=TLSv1.2,TLSv1.3also worked.
@slachiewicz @nnringit I am facing the same error when submitting spark app to Kubernetes. Could you, please, tell me where I should change or add KUBERNETES_TLS_VERSIONS=TLSv1.2,TLSv1.3?
Hi, I tried both the option:
- version change in spark-operator,driver,exectutor with env variable as
env:
- name: KUBERNETES_TLS_VERSIONS
value: "TLSv1.2"
` 3. with env variable as HTTP2_DISABLE="true" in spark-operator, driver, exectutor
env:
- name: HTTP2_DISABLE
value: "true"
But both the options are not able to resolve the issue. Can someone suggest me what am I missing?
@LeonardAukea @DoniyorTuremuratov @slachiewicz I am also facing the same issue with the latest spark-operator... I tried setting both KUBERNETES_TLS_VERSIONS
, as well as HTT2_DISABLE
env variables in operator, driver, and executor. But none of them seem to work. Is there any other recommended approach that I can try?
For what its worth, it might be related to the fact that spark-operator still sets up the spark-operator with kubernetes client version 4.12.0 which really only provides full support up to kubernetes version 1.18, with minimal support up to 1.22, and no support for versions 1.23+ compatibility matrix.
root@spark-operator-674c5dc89f-htl6p:/opt/spark/work-dir# ls ../jars | grep kubernetes
kubernetes-client-4.12.0.jar
kubernetes-model-admissionregistration-4.12.0.jar
kubernetes-model-apiextensions-4.12.0.jar
kubernetes-model-apps-4.12.0.jar
kubernetes-model-autoscaling-4.12.0.jar
kubernetes-model-batch-4.12.0.jar
kubernetes-model-certificates-4.12.0.jar
kubernetes-model-common-4.12.0.jar
kubernetes-model-coordination-4.12.0.jar
kubernetes-model-core-4.12.0.jar
kubernetes-model-discovery-4.12.0.jar
kubernetes-model-events-4.12.0.jar
kubernetes-model-extensions-4.12.0.jar
kubernetes-model-metrics-4.12.0.jar
kubernetes-model-networking-4.12.0.jar
kubernetes-model-policy-4.12.0.jar
kubernetes-model-rbac-4.12.0.jar
kubernetes-model-scheduling-4.12.0.jar
kubernetes-model-settings-4.12.0.jar
kubernetes-model-storageclass-4.12.0.jar
spark-kubernetes_2.12-3.1.1.jar
The latest kubernetes version is 1.26, with spark 3.3.0 even supporting kubernetes-client 5.12.2. Is there a way to at least make sure that the spark-operator uses kubernetes-client 5.12.2, and try with that to see if that fixes the issue ?
Below is my error for visibility:
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [spark-operator] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:349)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:84)
at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:139)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:672)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:680)
at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:153)
at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:135)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:41)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at okhttp3.RealCall.execute(RealCall.java:93)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:490)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:451)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:252)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:879)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:341)
... 14 more
23/03/27 16:15:03 INFO ShutdownHookManager: Shutdown hook called
23/03/27 16:15:03 INFO ShutdownHookManager: Deleting directory /tmp/spark-f18eea0e-6437-4444-b2fd-e429aedbf6b6
@JunaidChaudry I'm exactly stuck at where you are. Did you get a solution to this problem?
@LeonardAukea Can you specify how will one add an env var to Spark Operator. I've added the HTTP2_DISABLE var to driver and executor config, but has had no effect. How did you add it to the Operator itself?
@harshal-zetaris did you enable webhooks? I had to enable webhooks, and configure the webhook.port
to be 443 from the default 8000
I had the webhooks enabled, but didn't have the port configured. This solved it https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1708#issuecomment-1523442278
Wow! That worked @JunaidChaudry . However I'm confused as to why.
I literally spun a whole new EKS cluster just in March this year and have been using that as our official QA cluster. Deployments there are still going as smooth as butter.
I suddenly started getting into precisely this problem after I spun another cluster a couple of days back. The interesting thing is deployments on the old cluster are still working fine.
I read through the conversation in your linked issue and indeed the new version of node AMI has been released on May 1, post which this issue started manifesting.
Thank You so much for your help.
I am in the exact same boat as you. It has something to do with the AWS AMI update that was received in late March. I had multiple EKS clusters, with the webhook working out of the box on all of them... UNTIL I restarted my EKS nodes and they started running with the newer AWS AMI version. I did confirm that it was unrelated to the actual kubernetes version (all versions were behaving the same)
@JunaidChaudry @harshal-zetaris @satyamsah , any luck with fix for the SocketTimeoutException/K8SClientException
@JunaidChaudry hi, I got the same question with " Operation: [create] for kind: [Pod] with name: [null] in namespace: [spark-operator] failed". I didn't use helm to intall the operator, instead of pulling operator iamge and loading to container platform. I'm not sure if I've been enable webhook. Do you have any idea? Thanks.