spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Configmap and Volume mounts are not loading

Open sohnaeo opened this issue 2 years ago • 15 comments

Hi,

We recently upgrade our cluster from 1.15.1 ---> 1.21.5 . Now spark operator is not loading configmap and volume mounts.

Deployment mode: HELM Chart

Image: v1beta2-1.1.2-2.4.5 (We dont wana use latest image as developers would like to stick with Spark 2.4.7)

Kubernetes Version: 1.21.5

Helm command to install: helm install spark-operator --namespace pipeline-qa --set image.tag=v1beta2-1.1.2-2.4.5 --set webhook.enable=true -f values.yaml .

Spark operator pod starts successfully after webhook-init pod get completed.

Interestingly, it picks up the secret but not configmap even though configmap exists

Error is below

22/03/18 10:33:55 WARN NativeCodeLoader:main: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/03/18 10:34:22 INFO EventProcessor:main: appName: spark-event-processor Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'ORACLE_URL'

apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-event-processor spec: type: Scala mode: cluster image: xxxxxxxxxxxxxxxx/xxxx/spark-k8-2.4.7:latest imagePullPolicy: Always mainClass: xxxxx mainApplicationFile: "local:///usr/pre-enhancer.jar" sparkVersion: "2.4.7" restartPolicy: type: Never volumes: - name: "test-volume" nfs: path: /nfs/k8-files server: test sparkConf: "spark.ui.port": "4045" "spark.executor.extraJavaOptions": "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.xml" "spark.zookeeper.refdata.znode": "/datetime"

driver: cores: 1 coreLimit: "1000m" memory: "512m" podName: "spark-event-processor-driver" labels: version: 2.4.7 serviceAccount: spark envSecretKeyRefs: ORACLE_PASSWORD: name: spark-database-envars-secrets key: ORACLE_PASSWORD

env:
 - name: ORACLE_URL
    valueFrom:
      configMapKeyRef:
        name: database-envars-configmap
        key: ORACLE_URL
        
  - name: ORACLE_USER
    valueFrom:
      configMapKeyRef:
        name: database-envars-configmap
        key: ORACLE_USER`
        
 
volumeMounts:
  - name: "test-volume"
    mountPath: "/td-files"

executor: cores: 1 instances: 1 memory: "512m" labels: version: 2.4.7 volumeMounts: - name: "test-volume" mountPath: "/td-files"

sohnaeo avatar Mar 18 '22 00:03 sohnaeo

I tracked it down and it appears to enable volume/configmaps, webhook need to be enabled which is enabled but it wasnt working.

I looked at apiserver logs and caught the issues , it seems K8 version > 1.19 doesnt accept the common name certificates

x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

I put the below snippet in my apiserver that fixes the issue

env: - name: GODEBUG value: "x509ignoreCN=0"

My question is how can avoid this issue? Could someone please shed light on this? I dont want to use GODEBUG in apiservers.

sohnaeo avatar Mar 18 '22 07:03 sohnaeo

For anyone, who wants to use the old spark version with the latest K8 cluster , I followed the blow steps to fix this

  1. Change the gencerts.sh to include the AltName

  2. docker build

  3. push the latest docker

  4. Install the sparkoperator with the latets image

All worked as expect.

sohnaeo avatar Mar 21 '22 00:03 sohnaeo

@sohnaeo , i have a very similar issue as you mentioned above , we are also upgrading out cluster to 1.21.5

and while running spark operator jobs, spark operator is not loading configmap and volume mounts.

i read you tweaked apiserver , i am new to kubernetes & did not fully understand where exactly did you modify the apiserver ? also the gencert.sh changes , if you can elaborate how did you do this .

Thank you, Rahul

rahulkishore22 avatar Mar 22 '22 13:03 rahulkishore22

@rahulkishore22

Recap of the issue

Webhook is required to load configmaps and volume mounts. When you enable webhook in older version of spark operator, it creates certs (common name in certs) using secrets. But newer kubernetes version doesn't support common name anymore, apiserver rejects the webhook requests and thats the reason configmap/volumes doesnt work.

New version supports SAN (Subject Alternate Name).

There are two options to fix the issue

First Approach

Configure the API server to ignore CN. I am not sure how have you installed the kubernetes but below are the steps to configure the apiserver

login to masters cd /etc/kubernetes/manifests vi kube-apiserver.yml Add the below snipppet after image or imagepull policy env: - name: GODEBUG value: "x509ignoreCN=0" Save the file and api server should be started in 40-50 seconds and all should be working as expected

Second Approach

Download the source code ( in my case it is v1beta2-1.2.0-3.0.0) , change hack/gencerts.sh to enable SAN Find line

extendedKeyUsage = clientAuth, serverAuth

Add below after the above line

subjectAltName = DNS:${SERVICE}.${NAMESPACE}.svc

Also change the below line

openssl req -x509 -new -nodes -key ${TMP_DIR}/ca-key.pem -days 100000 -out ${TMP _DIR}/ca-cert.pem -subj "/CN=${SERVICE}.${NAMESPACE}_svc"

Build docker, push the image into your local repository and all should be working as expected. In this case, you dont need to change API server.

I went with this way as dont wana disable CN Ignore due to security issues

sohnaeo avatar Mar 23 '22 01:03 sohnaeo

We have the same problems with missing ConfigMaps and VolumesMaps in our started Spark-Pods. We are using the version v1beta2-1.3.3-3.1.1 with kubernetes v1.22.7 (microk8s). The changes in the file hack/gencerts.sh ("second approach" of sohnaeo) are already done in that version of the spark-operator. Any ideas what else could be the problem?

waras2017 avatar Mar 23 '22 19:03 waras2017

@waras2017

Did you enable the webhook? Without enabling webhook, ConfigMaps/VolumeMaps wouldnt be loading. I am installing through the helm chart and below is the command

helm-v3.6.3 install sparkoperator --namespace pipeline-qa --set sparkJobNamespace=pipeline-qa --set webhook.enable=true -f values.yaml .

Despite enabling the webhoook if it is not working, please check the logs of apiserver and it would show if there is any error about the webhook.

sohnaeo avatar Mar 24 '22 02:03 sohnaeo

@sohnaeo Yes we switched on the mutating webhook but it did not work. We searched for a very long time and finaly found the problem. It is the kubernetes version! 1.22.7 does not work (no volume mounts and no configMaps) with the spark-operator v1beta2-1.3.3-3.1.1. We changed the kubernetes Version to 1.21.7 and spark starts including configMaps! So in our case it was not a cert problem. In kubernetes 1.22+ there are some API removals and changes. It seems that the current version of the spark-operator does not support these API changes completely.

waras2017 avatar Mar 24 '22 19:03 waras2017

@waras2017

Did you check the apiserver logs? There must be an error or message in the logs that can lead to somewhere. It seems you have to wait for the new version of Spark Operator that supports the 1.22.7 K8 version

sohnaeo avatar Mar 25 '22 00:03 sohnaeo

@sohnaeo Yes we switched on the mutating webhook but it did not work. We searched for a very long time and finaly found the problem. It is the kubernetes version! 1.22.7 does not work (no volume mounts and no configMaps) with the spark-operator v1beta2-1.3.3-3.1.1. We changed the kubernetes Version to 1.21.7 and spark starts including configMaps! So in our case it was not a cert problem. In kubernetes 1.22+ there are some API removals and changes. It seems that the current version of the spark-operator does not support these API changes completely.

In my project, we have tested Spark operator with App version v1beta2-1.3.3-3.1.1 (Helm chart version 1.1.19-5) on K8s version 1.22.4 and we did not face similar issues.

indranilr avatar Apr 18 '22 07:04 indranilr

Hello!

On top of @waras2017 , @sohnaeo and @indranilr answers above, I also have the same issue. I tested with Spark Operator docker image on version v1beta2-1.3.3-3.1.1 and Helm chart versions 1.1.15 and 1.1.19. I'm using K8s 1.21.9 in an Azure Kubernetes Service (AKS) environment.

Looking at the apiserver logs, I didn't see much but I found this error (although I couldn't trace if that relates with the Spark Operator): E0424 17:22:09.386423 1 dispatcher.go:130] failed calling webhook "check-ignore-label.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s": x509: certificate signed by unknown authority

Still trying to figure out what's going on... if you have any advice, I'd really appreciate :)

lukasmeirelles avatar Apr 24 '22 18:04 lukasmeirelles

@sohnaeo, @indranilr

We have been able to solve our problem in the meantime! It was not a problem of the Kubernetes version, but the framework with which Kubernetes was installed.

We had used microk8s for the installation in the beginning. In this installation, we then encountered the problem with ConfigMaps and Volumes. These could not be successfully mounted into the Spark-Pods. In the logs we found the following:

"microk8s.daemon-kubelite[xxxxxxx]: W0324 08:05:22.229782 4193638 dispatcher.go:176] Failed calling webhook, failing open webhook.sparkoperator.k8s.io: failed calling webhook "webhook.sparkoperator.k8s.io": failed to call webhook: Post "https://xyz-spark-operator-webhook.spark-operator.svc:443/webhook?timeout=30s": tls: server chose an unconfigured cipher suite"

We then used kubespray instead of microk8s to install our Kubernetes cluster. With this the ConfigMaps and Volume mounts worked without any problems!

waras2017 avatar Jun 24 '22 12:06 waras2017

@sohnaeo I am currently running with v1beta2-1.3.8-3.1.1, do see the changes already in gencerts.sh for the second approach that you mentioned. But am still facing the same issue. I am running k8s v 1.23. Any ideas?

JunaidChaudry avatar Apr 12 '23 18:04 JunaidChaudry

@sohnaeo I am currently running with v1beta2-1.3.8-3.1.1, do see the changes already in gencerts.sh for the second approach that you mentioned. But am still facing the same issue. I am running k8s v 1.23. Any ideas?

Did you check API server logs? There must be errors that can get you some idea what need to be done.

sohnaeo avatar Apr 13 '23 06:04 sohnaeo

Seems like changing the webhook.port to 443 somehow fixes it https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1708#issuecomment-1523442278

JunaidChaudry avatar Apr 27 '23 15:04 JunaidChaudry

The solution @JunaidChaudry wrote worked for me too. 🙌 Thanks for the tip.

lfreinag avatar Jan 18 '24 16:01 lfreinag