spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Toleration is not passing to Driver and Executor pods

Open amiyajena1993 opened this issue 2 years ago • 7 comments

spark operator image version :- v1beta2-1.3.7-3.1.1 K8S version :- 1.26

-----------------------------------:

  # -- Enable webhook server
  enable: true
  # -- Webhook service port
  port: 443
  # -- The webhook server will only operate on namespaces with this label, specified in the form key1=value1,key2=value2.
  # Empty string (default) will operate on all namespaces
  namespaceSelector: app=ns-rm-dp-spark-operator ,app=ns-rm-dp-spark-jobs
    #namespaceSelector: "spark-webhook-enabled=true"
  # -- The annotations applied to the cleanup job, required for helm lifecycle hooks
  cleanupAnnotations:
    "helm.sh/hook": pre-delete, pre-upgrade
    "helm.sh/hook-delete-policy": hook-succeeded

Hello Team , i have enabled webhook in my spark operator pod despite that i am unable to apply toleration in my driver and executor pod .

Spark Job Yaml :-

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: sparkpi-test9227
  namespace: ns-rm-dp-spark-jobs
spec:
  type: Scala
  mode: cluster
  image: " "
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
  sparkVersion: "3.1.1"
  arguments:
  - "30000"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    coreLimit: "4000m"
    memory: "4g"
    labels:
      version: 3.1.1
    serviceAccount: svc-rm-dp-spark
    tolerations:
    - effect: NoSchedule
      key: pfamily
      operator: Equal
      value: "aiplatform"
  executor:
    tolerations:
        - key: "pfamily"
          operator: "Equal"
          value: "aiplatform"
          effect: "NoSchedule"
    cores: 1
    coreLimit: "4000m"
    instances: 3
    memory: "4g"
    labels:
      version: 3.1.1
  sparkConf:
   "spark.executor.extraJavaOptions": "-Djava.net.preferIPv6Addresses=true -Dlog4j.debug=true -Dcom.amazonaws.sdk.disableCertChecking=true"
   "spark.driver.extraJavaOptions": "-Djava.net.preferIPv6Addresses=true -Dlog4j.debug=true -Dcom.amazonaws.sdk.disableCertChecking=true"

~
~
Issue :- Driver pod goes to pending state , due to missing of tolerations .

error :- 0/50 nodes are available: 1 node(s) were unschedulable, 10 node(s) had untolerated taint {pfamily: symops}, 13 node(s) had untolerated taint {pfamily: aiplatform}, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 6 node(s) had untolerated taint {pfamily: penableric}, 9 node(s) had untolerated taint {pfamily: symplanbizsec}, 9 node(s) had untolerated taint {pfamily: symplatform}. preemption: 0/50 nodes are available: 50 Preemption is not helpful for scheduling..

amiyajena1993 avatar Nov 14 '23 10:11 amiyajena1993

I have the same problem, no tolerations on driver pod

piermotte avatar Dec 21 '23 15:12 piermotte

I see same issues as SparkApplication as even toleration exist in describe SparkApplication but the spark submit command is ignoring it and the driver/executor pod not attaching toleration

zy-wiser avatar Dec 27 '23 18:12 zy-wiser

I have the same problem too..

ahululu avatar Feb 28 '24 09:02 ahululu

Did you find any workaround to this problem?

biljicmarko avatar Apr 15 '24 10:04 biljicmarko

I ran into the same issue and enabling the webhooks fixed the problem.

values: {
  webhook: {
    enable: true,
  },
}

CHART spark-operator-1.2.14
APP VERSION v1beta2-1.4.5-3.5.0

pasdoy avatar May 04 '24 19:05 pasdoy

As @pasdoy mentioned, tolerations is patch in webhooks, enable it in your values.yaml, this issuse can be closed

imtzer avatar May 30 '24 04:05 imtzer