spark-operator
spark-operator copied to clipboard
Spark Executor is failing Unknown (Container not Found)
I've got a pretty vanilla version of the example that I'm trying to run but I'm getting this strange failure, I'm not even sure where to look to debug this.
My deployment
#
# Copyright 2017 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: env-z83u9r
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v3.1.1"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
sparkVersion: "3.1.1"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.1.1
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.1.1
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
My RBAC
#
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
apiVersion: v1
kind: Namespace
metadata:
name: env-z83u9r
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: env-z83u9r
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: env-z83u9r
name: spark-role
rules:
- apiGroups: [""]
resources: ["pods", "configmaps"]
verbs: ["*"]
- apiGroups: [""]
resources: ["services"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-role-binding
namespace: env-z83u9r
subjects:
- kind: ServiceAccount
name: spark
namespace: env-z83u9r
roleRef:
kind: Role
name: spark-role
apiGroup: rbac.authorization.k8s.io
Output from the operator
I0121 18:45:47.466658 9 controller.go:263] Starting processing key: "env-z83u9r/spark-pi"
I0121 18:45:47.466829 9 controller.go:822] Update the status of SparkApplication env-z83u9r/spark-pi from:
{
"sparkApplicationId": "spark-18ae0a4e328e4065b5e3290ec9cb456a",
"submissionID": "a7c44822-8479-4fb1-9727-02512fba27ce",
"lastSubmissionAttemptTime": "2022-01-21T18:45:22Z",
"terminationTime": "2022-01-21T18:45:46Z",
"driverInfo": {
"podName": "spark-pi-driver"
},
"applicationState": {
"state": "COMPLETED"
},
"executorState": {
"spark-pi-b3ca227e7df53ff3-exec-1": "RUNNING"
},
"executionAttempts": 1,
"submissionAttempts": 1
}
to:
{
"sparkApplicationId": "spark-18ae0a4e328e4065b5e3290ec9cb456a",
"submissionID": "a7c44822-8479-4fb1-9727-02512fba27ce",
"lastSubmissionAttemptTime": "2022-01-21T18:45:22Z",
"terminationTime": "2022-01-21T18:45:46Z",
"driverInfo": {
"podName": "spark-pi-driver"
},
"applicationState": {
"state": "COMPLETED"
},
"executorState": {
"spark-pi-b3ca227e7df53ff3-exec-1": "FAILED"
},
"executionAttempts": 1,
"submissionAttempts": 1
}
I0121 18:45:47.468668 9 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"env-z83u9r", Name:"spark-pi", UID:"3fbbb2df-ac7d-414c-b788-9ce9aa79b5d7", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"41402250", FieldPath:""}): type: 'Warning' reason: 'SparkExecutorFailed' Executor [spark-pi-b3ca227e7df53ff3-exec-1 %!s(int=-1) Unknown (Container not Found)] failed with ExitCode: %!d(MISSING), Reason: %!s(MISSING)
I0121 18:45:47.495473 9 controller.go:223] SparkApplication env-z83u9r/spark-pi was updated, enqueuing it
I0121 18:45:47.501062 9 sparkapp_metrics.go:287] Exporting Metrics for Executor spark-pi-b3ca227e7df53ff3-exec-1. OldState: RUNNING NewState: FAILED
I0121 18:45:47.501123 9 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 0
I0121 18:45:47.501157 9 controller.go:270] Ending processing key: "env-z83u9r/spark-pi"
I0121 18:45:47.501184 9 controller.go:263] Starting processing key: "env-z83u9r/spark-pi"
I0121 18:45:47.501531 9 controller.go:270] Ending processing key: "env-z83u9r/spark-pi"
I0121 18:45:54.089752 9 spark_pod_eventhandler.go:58] Pod spark-pi-b3ca227e7df53ff3-exec-1 updated in namespace env-z83u9r.
I0121 18:45:54.090161 9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication env-z83u9r/spark-pi for app update processing.
I0121 18:45:54.090183 9 controller.go:263] Starting processing key: "env-z83u9r/spark-pi"
I0121 18:45:54.090245 9 controller.go:270] Ending processing key: "env-z83u9r/spark-pi"
I0121 18:45:54.090344 9 spark_pod_eventhandler.go:77] Pod spark-pi-b3ca227e7df53ff3-exec-1 deleted in namespace env-z83u9r.
I0121 18:45:54.090357 9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication env-z83u9r/spark-pi for app update processing.
I0121 18:45:54.090369 9 controller.go:263] Starting processing key: "env-z83u9r/spark-pi"
I0121 18:45:54.090407 9 controller.go:270] Ending processing key: "env-z83u9r/spark-pi"
I grabbed the logs from the executor pod before it terminates.
╰$ k logs -n env-z83u9r -f spark-pi-3d89887e7e908a0c-exec-1
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID)
+ exec /usr/bin/tini -s -- /usr/local/openjdk-11/bin/java -Dspark.driver.blockManager.port=7079 -Dspark.driver.port=7078 -Xms512m -Xmx512m -cp '/opt/spark/conf::/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc:7078 --executor-id 1 --cores 1 --app-id spark-ce8ab7c4233c46fca4c0e1038e317d49 --hostname 10.0.50.105 --resourceProfileId 0
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/01/21 21:35:16 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 15@spark-pi-3d89887e7e908a0c-exec-1
22/01/21 21:35:16 INFO SignalUtils: Registering signal handler for TERM
22/01/21 21:35:16 INFO SignalUtils: Registering signal handler for HUP
22/01/21 21:35:16 INFO SignalUtils: Registering signal handler for INT
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/01/21 21:35:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/21 21:35:17 INFO SecurityManager: Changing view acls to: 185,root
22/01/21 21:35:17 INFO SecurityManager: Changing modify acls to: 185,root
22/01/21 21:35:17 INFO SecurityManager: Changing view acls groups to:
22/01/21 21:35:17 INFO SecurityManager: Changing modify acls groups to:
22/01/21 21:35:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, root); groups with view permissions: Set(); users with modify permissions: Set(185, root); groups with modify permissions: Set()
22/01/21 21:35:18 INFO TransportClientFactory: Successfully created connection to spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc/10.0.50.110:7078 after 186 ms (0 ms spent in bootstraps)
22/01/21 21:35:18 INFO SecurityManager: Changing view acls to: 185,root
22/01/21 21:35:18 INFO SecurityManager: Changing modify acls to: 185,root
22/01/21 21:35:18 INFO SecurityManager: Changing view acls groups to:
22/01/21 21:35:18 INFO SecurityManager: Changing modify acls groups to:
22/01/21 21:35:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185, root); groups with view permissions: Set(); users with modify permissions: Set(185, root); groups with modify permissions: Set()
22/01/21 21:35:18 INFO TransportClientFactory: Successfully created connection to spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc/10.0.50.110:7078 after 6 ms (0 ms spent in bootstraps)
22/01/21 21:35:19 INFO DiskBlockManager: Created local directory at /var/data/spark-66a6906c-47a7-4545-9327-47c47548b91a/blockmgr-9fcc69cb-702e-4da9-9c05-392007a4a938
22/01/21 21:35:19 INFO MemoryStore: MemoryStore started with capacity 117.0 MiB
22/01/21 21:35:19 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc:7078
22/01/21 21:35:19 INFO ResourceUtils: ==============================================================
22/01/21 21:35:19 INFO ResourceUtils: No custom resources configured for spark.executor.
22/01/21 21:35:19 INFO ResourceUtils: ==============================================================
22/01/21 21:35:19 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
22/01/21 21:35:19 INFO Executor: Starting executor ID 1 on host 10.0.50.105
22/01/21 21:35:19 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43781.
22/01/21 21:35:19 INFO NettyBlockTransferService: Server created on 10.0.50.105:43781
22/01/21 21:35:19 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/01/21 21:35:19 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, 10.0.50.105, 43781, None)
22/01/21 21:35:19 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, 10.0.50.105, 43781, None)
22/01/21 21:35:19 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, 10.0.50.105, 43781, None)
22/01/21 21:35:19 INFO Executor: Fetching file:/opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar with timestamp 1642800906140
22/01/21 21:35:19 INFO Utils: Copying /opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar to /var/data/spark-66a6906c-47a7-4545-9327-47c47548b91a/spark-f063f5c7-3790-49be-bbfc-033aa5ebf1b6/15615854841642800906140_cache
22/01/21 21:35:19 INFO Utils: Copying /var/data/spark-66a6906c-47a7-4545-9327-47c47548b91a/spark-f063f5c7-3790-49be-bbfc-033aa5ebf1b6/15615854841642800906140_cache to /opt/spark/work-dir/./spark-examples_2.12-3.1.1.jar
22/01/21 21:35:19 INFO Executor: Adding file:/opt/spark/work-dir/./spark-examples_2.12-3.1.1.jar to class loader
22/01/21 21:35:21 INFO CoarseGrainedExecutorBackend: Got assigned task 0
22/01/21 21:35:21 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/01/21 21:35:21 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
22/01/21 21:35:21 INFO TransportClientFactory: Successfully created connection to spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc/10.0.50.110:7079 after 9 ms (0 ms spent in bootstraps)
22/01/21 21:35:21 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1816.0 B, free 117.0 MiB)
22/01/21 21:35:21 INFO TorrentBroadcast: Reading broadcast variable 0 took 213 ms
22/01/21 21:35:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.1 KiB, free 117.0 MiB)
22/01/21 21:35:22 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1000 bytes result sent to driver
22/01/21 21:35:22 INFO CoarseGrainedExecutorBackend: Got assigned task 1
22/01/21 21:35:22 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
22/01/21 21:35:22 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 957 bytes result sent to driver
22/01/21 21:35:22 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
22/01/21 21:35:22 INFO MemoryStore: MemoryStore cleared
22/01/21 21:35:22 INFO BlockManager: BlockManager stopped
22/01/21 21:35:22 INFO ShutdownHookManager: Shutdown hook called
22/01/21 21:35:22 INFO ShutdownHookManager: Deleting directory /var/data/spark-66a6906c-47a7-4545-9327-47c47548b91a/spark-f063f5c7-3790-49be-bbfc-033aa5ebf1b6
rpc error: code = NotFound desc = could not find container "bcf109cfdf10a107224e316cd6effc792049be6a3c5c67daac91db76c45e9fa8": container with ID starting with bcf109cfdf10a107224e316cd6effc792049be6a3c5c67daac91db76c45e9fa8 not found: ID does not exist%
@kevinlmadison hey, were you able to resolve this issue?
@kevinlmadison Did you resolve the issue? What was the fix ?
@kevinlmadison I had the same problem.
You can check if your namespace has istio-injection enabled with the following command:
kubectl describe ns/<your namespace>
If you see "istio-injection=enabled", then you need to explicitly turn off istio-injection in your SparkApplication Yaml file.
For Example:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: <your namespace>
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v3.1.1"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
sparkVersion: "3.1.1"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.1.1
annotations:
sidecar.istio.io/inject: "false"
serviceAccount: <your sa>
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.1.1
annotations:
sidecar.istio.io/inject: "false"
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
@kevinlmadison I had the same problem. You can check if your namespace has istio-injection enabled with the following command:
kubectl describe ns/<your namespace>
If you see "istio-injection=enabled", then you need to explicitly turn off istio-injection in your SparkApplication Yaml file. For Example:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi namespace: <your namespace> spec: type: Scala mode: cluster image: "gcr.io/spark-operator/spark:v3.1.1" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar" sparkVersion: "3.1.1" restartPolicy: type: Never volumes: - name: "test-volume" hostPath: path: "/tmp" type: Directory driver: cores: 1 coreLimit: "1200m" memory: "512m" labels: version: 3.1.1 annotations: sidecar.istio.io/inject: "false" serviceAccount: <your sa> volumeMounts: - name: "test-volume" mountPath: "/tmp" executor: cores: 1 instances: 1 memory: "512m" labels: version: 3.1.1 annotations: sidecar.istio.io/inject: "false" volumeMounts: - name: "test-volume" mountPath: "/tmp"
Unfortunately didn't work for me because istio-injection is not set. Keep getting this error thrown and I have no clue where to look or what to troubleshoot.
Anyone found a root cause or a possible (other) solution?