Toleration is not passing to Driver and Executor pods
spark operator image version :- v1beta2-1.3.7-3.1.1 K8S version :- 1.26
-----------------------------------:
# -- Enable webhook server
enable: true
# -- Webhook service port
port: 443
# -- The webhook server will only operate on namespaces with this label, specified in the form key1=value1,key2=value2.
# Empty string (default) will operate on all namespaces
namespaceSelector: app=ns-rm-dp-spark-operator ,app=ns-rm-dp-spark-jobs
#namespaceSelector: "spark-webhook-enabled=true"
# -- The annotations applied to the cleanup job, required for helm lifecycle hooks
cleanupAnnotations:
"helm.sh/hook": pre-delete, pre-upgrade
"helm.sh/hook-delete-policy": hook-succeeded
Hello Team , i have enabled webhook in my spark operator pod despite that i am unable to apply toleration in my driver and executor pod .
Spark Job Yaml :-
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: sparkpi-test9227
namespace: ns-rm-dp-spark-jobs
spec:
type: Scala
mode: cluster
image: " "
imagePullPolicy: IfNotPresent
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
sparkVersion: "3.1.1"
arguments:
- "30000"
restartPolicy:
type: Never
driver:
cores: 1
coreLimit: "4000m"
memory: "4g"
labels:
version: 3.1.1
serviceAccount: svc-rm-dp-spark
tolerations:
- effect: NoSchedule
key: pfamily
operator: Equal
value: "aiplatform"
executor:
tolerations:
- key: "pfamily"
operator: "Equal"
value: "aiplatform"
effect: "NoSchedule"
cores: 1
coreLimit: "4000m"
instances: 3
memory: "4g"
labels:
version: 3.1.1
sparkConf:
"spark.executor.extraJavaOptions": "-Djava.net.preferIPv6Addresses=true -Dlog4j.debug=true -Dcom.amazonaws.sdk.disableCertChecking=true"
"spark.driver.extraJavaOptions": "-Djava.net.preferIPv6Addresses=true -Dlog4j.debug=true -Dcom.amazonaws.sdk.disableCertChecking=true"
~
~
Issue :- Driver pod goes to pending state , due to missing of tolerations .
error :- 0/50 nodes are available: 1 node(s) were unschedulable, 10 node(s) had untolerated taint {pfamily: symops}, 13 node(s) had untolerated taint {pfamily: aiplatform}, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 6 node(s) had untolerated taint {pfamily: penableric}, 9 node(s) had untolerated taint {pfamily: symplanbizsec}, 9 node(s) had untolerated taint {pfamily: symplatform}. preemption: 0/50 nodes are available: 50 Preemption is not helpful for scheduling..
I have the same problem, no tolerations on driver pod
I see same issues as SparkApplication as even toleration exist in describe SparkApplication but the spark submit command is ignoring it and the driver/executor pod not attaching toleration
I have the same problem too..
Did you find any workaround to this problem?
I ran into the same issue and enabling the webhooks fixed the problem.
values: {
webhook: {
enable: true,
},
}
CHART spark-operator-1.2.14
APP VERSION v1beta2-1.4.5-3.5.0
As @pasdoy mentioned, tolerations is patch in webhooks, enable it in your values.yaml, this issuse can be closed