spark-operator Config Maps and Volumes are not getting mounted

Our GKE cluster is running on Kubernetes version v1.21.14. Pods were running all good until yesterday, now Configmaps and Volumes are not getting Mounted.

Deployment Mode: Helm Chart

Helm Chart Version: 1.1.0

Image: v1beta2-1.2.3-3.1.1

Kubernetes Version: 1.21.14

Helm command to install: helm install spark-operator --namespace *** --set image.tag= v1beta2-1.2.3-3.1.1 --set webhook.enable=true -f values.yaml

Spark operator pod starts successfully after webhook-init pod get completed.

But my application pod using Spark Operator is unable to come up due to below error:

Events: │ │ Type Reason Age From Message │ │ ---- ------ ---- ---- ------- │ │ Normal Scheduled 49m default-scheduler Successfully assigned pod/re-driver to gke │ │ w--taints-6656b326-49of │ │ Warning FailedMount 49m kubelet MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-8c0f12839ca69805-conf-map" not found │ │ Warning FailedMount 27m (x3 over 40m) kubelet Unable to attach or mount volumes: unmounted volumes=[re-checkpoint], unattached volumes=[spark-conf-volume-driver kube-api-acce │ │ s-lsflz re-checkpoint app-conf-vol cert-secret-volume spark-local-dir-1[]: timed out waiting for the condition │ │ Warning FailedMount 20m (x2 over 45m) kubelet Unable to attach or mount volumes: unmounted volumes=[re-checkpoint], unattached volumes=[kube-api-access-lsflz re-checkpoint app-conf-vol cert-secret-volume spark-local-dir-1 spark-conf-volume-driver[]: timed out waiting for the condition

Oct 03 '22 07:10 infa-madhanb

Is there a solution?

Oct 18 '22 04:10 jiamin13579

Any update?

Nov 10 '22 11:11 Fiorellaps

MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-a4a28f849e410e3b-conf-map" not found FailedMount

happens with default settings.

Nov 22 '22 07:11 Elsayed91

Seeing exact error mentioned by @Liftingthedata as well, even after attaching the cluster-admin ClusterRole to the ServiceAccount created by chart installation through new ClusterRoleBinding and specifying the aforementioned ServiceAccount in examples/spark-pi.yaml.

Executingkubectl describe sparkapplication spark-pi --namespace <your ns> reveals that it is the spark-pi-driver failing. Inspecting the spark-pi-driver pod with kubectl describe pod spark-pi-driver --namespace <your ns> shows the kubelet MountVolume.SetUp failure message directly after pulling image "gcr.io/spark-operator/spark:v3.1.1". Is this perhaps an error within the image being pulled?

Please help!

Dec 09 '22 00:12 jnkroeker

I encountered similar issue. The problem frequently happen when the spark operator is under provisioned or under high load.

Dec 12 '22 06:12 pradithya

@pradithya could you please share your node configuration?

Dec 13 '22 00:12 jnkroeker

Is there any solution for this? I am also facing the similar issue. I am not sure but what I think is driver pod is trying to mount configmap before it is getting created which shows that configmap not found.

Warning FailedMount 66s kubelet MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-f42632859b918eee-conf-map" not found

Jan 10 '23 12:01 sunnysmane

I've began to encounter it constantly once I've introduced environment variables to my Scala script (and properly modified K8s manifest). Haven't found out how to solve it yet.

EDIT: quite late update, but this were mainly memory leaks (insufficient resources) on my side :)

Feb 14 '23 00:02 RSKriegs

hi guys, any solution for this issue ?

Mar 08 '23 22:03 ericklcl

Hi All,

Experienced this when we migrated to the helm chart installation of the spark operator, our volumes were mounted correctly via configmaps, but kubernetes was erroring out..

Make sure you have the following settings enabled on the helm chart below...

webhook:
  # -- Enable webhook server
  enable: true
  namespaceSelector: "spark-webhook-enabled=true"

Then label the spark namespace (or target namespace for your spark jobs) with:

spark-webhook-enabled=true

We found that was enough to get it working.

Application side

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
 name: example-test
 namespace: spark
spec:
 schedule: "31 12 * * *"
 concurrencyPolicy: Allow
 template:
   timeToLiveSeconds: 1200
   type: Python
   arguments:
     - --config-file-name=/opt/spark/work-dir/config/config.ini
   sparkConf:
     spark.kubernetes.decommission.script: "/opt/decom.sh"
     .  . .
   hadoopConf:
     fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
     . . .
   mode: cluster
   imagePullPolicy: Always
   mainApplicationFile: local:///opt/spark/work-dir/run.py
   sparkVersion: "3.2.1"
   restartPolicy:
       type: Never
   driver:
     cores: 1
     coreLimit: "500m"
     memory: "500m"
     labels:
       version: 3.2.1
     serviceAccount: job-tole
     volumeMounts:
       - name: "config"
         mountPath: "/opt/spark/work-dir/config"
   executor:
     cores: 1
     instances: 1
     memory: "500m"
     labels:
       version: 3.2.1
     volumeMounts:
       - name: "config"
         mountPath: "/opt/spark/work-dir/config"
   volumes:
     - name: "config"
       configMap:
         name: "config"
         items:
           - key: "config.ini"
             path: "config.ini"

Please note that: If you don't use the helm chart you still need to enable the webhook, otherwise the spark-operator won't be able to create the right configmaps and volume mounts on the driver and executor pods when they are spawned.

Mar 15 '23 12:03 GaryLouisStewart

I have enabled the webhook, using the namespaceSelector with the correct selectors, and still having the issue. Any idea? I am using the latest version. I have also tried enabling webhook on all namespaces, but still face the same issue.

I am unable to use tolerations as well

Apr 12 '23 18:04 JunaidChaudry

Does not work for me also. Followed the steps mentioned above but still getting same error.

Jun 09 '23 10:06 balkrishan333

I have enabled the webhook, using the namespaceSelector with the correct selectors, and still having the issue. Any idea? I am using the latest version. I have also tried enabling webhook on all namespaces, but still face the same issue.

I am unable to use tolerations as well

Did you manage to solve this issue?

Jun 09 '23 10:06 balkrishan333

The spark-pi job just hangs either before the driver is initialized or after driver the starts running. I do see the config-map load error in the driver's events but it does get created afterwards. Is this a resource problem? I'm running this on minikube with 4 CPU and 8GB memory !!

Jun 14 '23 17:06 percymehta

My issue was the webhook port.. It for some reason doesn't run on the default port anymore. So I had to update the port to 443 based on the docs here, even though I'm on EKS instead of GKE

Jun 20 '23 16:06 JunaidChaudry

My issue was the webhook port.. It for some reason doesn't run on the default port anymore. So I had to update the port to 443 based on the docs here, even though I'm on EKS instead of GKE

Thank you. I am using AKS and it worked for me as well.

Jul 20 '23 13:07 balkrishan333

I've done everything mentioned here with no success

Dec 15 '23 12:12 davidmirror-ops

The webhook only adds volumes if the driver/executor has a volumeMount for them: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/webhook/patch.go#L138-L143. The same goes for configMaps https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/webhook/patch.go#L335C1-L339

The code doesn't check if a driver/executor initContainer/sidecar mounts the vols. As a workaround you just have to add the volumeMounts directly to the driver/executor spec as well

Jan 18 '24 16:01 jalkjaer

Bump, is there any other possible solution? I have tried all the above with no success. Giving my configurations if it is of any help. Using helm chart 1.1.27 and v1beta2-1.3.8-3.1.1

values.yaml

# https://github.com/kubeflow/spark-operator/tree/master/charts/spark-operator-chart
nameOverride: spark-operator
fullnameOverride: spark-operator

image:
  # -- Image repository
  repository: ghcr.io/googlecloudplatform/spark-operator
  # -- Image pull policy
  pullPolicy: IfNotPresent
  # -- if set, override the image tag whose default is the chart appVersion.
  tag: "v1beta2-1.3.8-3.1.1"

imagePullSecrets: 
  - name: regcred

sparkJobNamespace: spark-operator

resources:
  limits:
    cpu: 1
    memory: 512Mi
  requests:
    cpu: 1
    memory: 512Mi

webhook:
  enable: true
  port: 443
  namespaceSelector: "spark-webhook-enabled=true"

SparkApplication manifest

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi2
  namespace: spark-operator
spec:
  type: Scala
  mode: cluster
  image: "apache/spark:3.4.2"
  imagePullPolicy: IfNotPresent
  imagePullSecrets:
    - regcred
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar"
  sparkVersion: "3.4.2"
  timeToLiveSeconds: 600
  restartPolicy:
    type: Never
  volumes:
    - name: config-vol
      configMap:
        name: cm-spark-extra
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.4.2
    serviceAccount: airflow-next
    volumeMounts:
      - name: config-vol
        mountPath: /mnt/cm-spark-extra
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.4.2
    volumeMounts:
      - name: config-vol
        mountPath: /mnt/cm-spark-extra

Here is the container and volume spec of the pod being spun up

spec:
  volumes:
    - name: aws-iam-token
      projected:
        sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
        defaultMode: 420
    - name: spark-local-dir-1
      emptyDir: {}
    - name: spark-conf-volume-driver
      configMap:
        name: spark-drv-b14d8f8f2b497a58-conf-map
        items:
          - key: spark.properties
            path: spark.properties
            mode: 420
        defaultMode: 420
    - name: kube-api-access-tf4pb
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: spark-kubernetes-driver
      image: apache/spark:3.4.2
      args:
        - driver
        - '--properties-file'
        - /opt/spark/conf/spark.properties
        - '--class'
        - org.apache.spark.examples.SparkPi
        - local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar
      ports:
        - name: driver-rpc-port
          containerPort: 7078
          protocol: TCP
        - name: blockmanager
          containerPort: 7079
          protocol: TCP
        - name: spark-ui
          containerPort: 4040
          protocol: TCP
      env:
        - name: SPARK_USER
          value: root
        - name: SPARK_APPLICATION_ID
          value: spark-2d80cebdab33400b83cbfe61fd09faee
        - name: SPARK_DRIVER_BIND_ADDRESS
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: SPARK_LOCAL_DIRS
          value: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
        - name: SPARK_CONF_DIR
          value: /opt/spark/conf
        - name: AWS_STS_REGIONAL_ENDPOINTS
          value: regional
        - name: AWS_DEFAULT_REGION
          value: us-east-1
        - name: AWS_REGION
          value: us-east-1
        - name: AWS_ROLE_ARN
          value: arn:aws:iam::123456:role/my-sa
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      resources:
        limits:
          cpu: 1200m
          memory: 896Mi
        requests:
          cpu: '1'
          memory: 896Mi
      volumeMounts:
        - name: spark-local-dir-1
          mountPath: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
        - name: spark-conf-volume-driver
          mountPath: /opt/spark/conf
        - name: kube-api-access-tf4pb
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        - name: aws-iam-token
          readOnly: true
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent

I am at a lost here. spark-conf-volume-driver is being set up as spark-drv-b14d8f8f2b497a58-conf-map and I see it under my kubernetes configmaps. I am on EKS and set webhook.enable to true and the port to 443. I did the work around configured a volumeMount under both the driver and executor although I do not see this being configured under my pod however I do not think this matters? As far as I know, everything is configured correctly. Is there somebody that can help?

Apr 29 '24 19:04 BCantos17

Bump, is there any other possible solution? I have tried all the above with no success. Giving my configurations if it is of any help. Using helm chart 1.1.27 and v1beta2-1.3.8-3.1.1

values.yaml

# https://github.com/kubeflow/spark-operator/tree/master/charts/spark-operator-chart
nameOverride: spark-operator
fullnameOverride: spark-operator

image:
  # -- Image repository
  repository: ghcr.io/googlecloudplatform/spark-operator
  # -- Image pull policy
  pullPolicy: IfNotPresent
  # -- if set, override the image tag whose default is the chart appVersion.
  tag: "v1beta2-1.3.8-3.1.1"

imagePullSecrets: 
  - name: regcred

sparkJobNamespace: spark-operator

resources:
  limits:
    cpu: 1
    memory: 512Mi
  requests:
    cpu: 1
    memory: 512Mi

webhook:
  enable: true
  port: 443
  namespaceSelector: "spark-webhook-enabled=true"

SparkApplication manifest

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi2
  namespace: spark-operator
spec:
  type: Scala
  mode: cluster
  image: "apache/spark:3.4.2"
  imagePullPolicy: IfNotPresent
  imagePullSecrets:
    - regcred
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar"
  sparkVersion: "3.4.2"
  timeToLiveSeconds: 600
  restartPolicy:
    type: Never
  volumes:
    - name: config-vol
      configMap:
        name: cm-spark-extra
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.4.2
    serviceAccount: airflow-next
    volumeMounts:
      - name: config-vol
        mountPath: /mnt/cm-spark-extra
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.4.2
    volumeMounts:
      - name: config-vol
        mountPath: /mnt/cm-spark-extra

Here is the container and volume spec of the pod being spun up

spec:
  volumes:
    - name: aws-iam-token
      projected:
        sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
        defaultMode: 420
    - name: spark-local-dir-1
      emptyDir: {}
    - name: spark-conf-volume-driver
      configMap:
        name: spark-drv-b14d8f8f2b497a58-conf-map
        items:
          - key: spark.properties
            path: spark.properties
            mode: 420
        defaultMode: 420
    - name: kube-api-access-tf4pb
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: spark-kubernetes-driver
      image: apache/spark:3.4.2
      args:
        - driver
        - '--properties-file'
        - /opt/spark/conf/spark.properties
        - '--class'
        - org.apache.spark.examples.SparkPi
        - local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar
      ports:
        - name: driver-rpc-port
          containerPort: 7078
          protocol: TCP
        - name: blockmanager
          containerPort: 7079
          protocol: TCP
        - name: spark-ui
          containerPort: 4040
          protocol: TCP
      env:
        - name: SPARK_USER
          value: root
        - name: SPARK_APPLICATION_ID
          value: spark-2d80cebdab33400b83cbfe61fd09faee
        - name: SPARK_DRIVER_BIND_ADDRESS
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: SPARK_LOCAL_DIRS
          value: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
        - name: SPARK_CONF_DIR
          value: /opt/spark/conf
        - name: AWS_STS_REGIONAL_ENDPOINTS
          value: regional
        - name: AWS_DEFAULT_REGION
          value: us-east-1
        - name: AWS_REGION
          value: us-east-1
        - name: AWS_ROLE_ARN
          value: arn:aws:iam::123456:role/my-sa
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      resources:
        limits:
          cpu: 1200m
          memory: 896Mi
        requests:
          cpu: '1'
          memory: 896Mi
      volumeMounts:
        - name: spark-local-dir-1
          mountPath: /var/data/spark-1505b0f4-c95a-4bba-aa76-9451b34af9ea
        - name: spark-conf-volume-driver
          mountPath: /opt/spark/conf
        - name: kube-api-access-tf4pb
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        - name: aws-iam-token
          readOnly: true
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent

I am at a lost here. spark-conf-volume-driver is being set up as spark-drv-b14d8f8f2b497a58-conf-map and I see it under my kubernetes configmaps. I am on EKS and set webhook.enable to true and the port to 443. I did the work around configured a volumeMount under both the driver and executor although I do not see this being configured under my pod however I do not think this matters? As far as I know, everything is configured correctly. Is there somebody that can help?

I'm having the same issue on OCI. I've followed the same steps.

May 14 '24 22:05 luis-fnogueira

I also noticed this problem with a very limited CPU (resources.limits.cpu: "100m") and two concurrent spark apps. It's very consistent and in that case the configmap for the driver was created for only one app. After updating the resources (requests to 1 CPU, and no limit) this odd behavior disappeared.

Jun 07 '24 16:06 thof

this still happens, using spark-submit on a high number of submits sometimes occur.

Jul 26 '24 16:07 dannyeuu

@dannyeuu Try to increase the spark operator's CPU request/limit. I encountered this issue when the operator is experiencing high utilization/throttling.

Jul 26 '24 17:07 pradithya

Curious if this shares the root cause with another issue I saw. Does anyone see client-side throttling logs for the operator? Should look something like this:

Waited for ... due to client-side throttling, not priority and fairness ...

Jul 28 '24 21:07 jacobsalway

i faced this issue when using initContainer. if i dont use initContainer, then no error

Aug 15 '24 06:08 cometta