spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Spark Executor is failing Unknown (Container not Found)

Open kevinlmadison opened this issue 3 years ago • 4 comments

I've got a pretty vanilla version of the example that I'm trying to run but I'm getting this strange failure, I'm not even sure where to look to debug this.

My deployment

#
# Copyright 2017 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: env-z83u9r
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v3.1.1"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
  sparkVersion: "3.1.1"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

My RBAC

#
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
apiVersion: v1
kind: Namespace
metadata:
  name: env-z83u9r
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: env-z83u9r
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: env-z83u9r
  name: spark-role
rules:
- apiGroups: [""]
  resources: ["pods", "configmaps"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: env-z83u9r
subjects:
- kind: ServiceAccount
  name: spark
  namespace: env-z83u9r
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

Output from the operator

I0121 18:45:47.466658       9 controller.go:263] Starting processing key: "env-z83u9r/spark-pi"
I0121 18:45:47.466829       9 controller.go:822] Update the status of SparkApplication env-z83u9r/spark-pi from:
{
  "sparkApplicationId": "spark-18ae0a4e328e4065b5e3290ec9cb456a",
  "submissionID": "a7c44822-8479-4fb1-9727-02512fba27ce",
  "lastSubmissionAttemptTime": "2022-01-21T18:45:22Z",
  "terminationTime": "2022-01-21T18:45:46Z",
  "driverInfo": {
    "podName": "spark-pi-driver"
  },
  "applicationState": {
    "state": "COMPLETED"
  },
  "executorState": {
    "spark-pi-b3ca227e7df53ff3-exec-1": "RUNNING"
  },
  "executionAttempts": 1,
  "submissionAttempts": 1
}
to:
{
  "sparkApplicationId": "spark-18ae0a4e328e4065b5e3290ec9cb456a",
  "submissionID": "a7c44822-8479-4fb1-9727-02512fba27ce",
  "lastSubmissionAttemptTime": "2022-01-21T18:45:22Z",
  "terminationTime": "2022-01-21T18:45:46Z",
  "driverInfo": {
    "podName": "spark-pi-driver"
  },
  "applicationState": {
    "state": "COMPLETED"
  },
  "executorState": {
    "spark-pi-b3ca227e7df53ff3-exec-1": "FAILED"
  },
  "executionAttempts": 1,
  "submissionAttempts": 1
}
I0121 18:45:47.468668       9 event.go:282] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"env-z83u9r", Name:"spark-pi", UID:"3fbbb2df-ac7d-414c-b788-9ce9aa79b5d7", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"41402250", FieldPath:""}): type: 'Warning' reason: 'SparkExecutorFailed' Executor [spark-pi-b3ca227e7df53ff3-exec-1 %!s(int=-1) Unknown (Container not Found)] failed with ExitCode: %!d(MISSING), Reason: %!s(MISSING)
I0121 18:45:47.495473       9 controller.go:223] SparkApplication env-z83u9r/spark-pi was updated, enqueuing it
I0121 18:45:47.501062       9 sparkapp_metrics.go:287] Exporting Metrics for Executor spark-pi-b3ca227e7df53ff3-exec-1. OldState: RUNNING NewState: FAILED
I0121 18:45:47.501123       9 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 0
I0121 18:45:47.501157       9 controller.go:270] Ending processing key: "env-z83u9r/spark-pi"
I0121 18:45:47.501184       9 controller.go:263] Starting processing key: "env-z83u9r/spark-pi"
I0121 18:45:47.501531       9 controller.go:270] Ending processing key: "env-z83u9r/spark-pi"
I0121 18:45:54.089752       9 spark_pod_eventhandler.go:58] Pod spark-pi-b3ca227e7df53ff3-exec-1 updated in namespace env-z83u9r.
I0121 18:45:54.090161       9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication env-z83u9r/spark-pi for app update processing.
I0121 18:45:54.090183       9 controller.go:263] Starting processing key: "env-z83u9r/spark-pi"
I0121 18:45:54.090245       9 controller.go:270] Ending processing key: "env-z83u9r/spark-pi"
I0121 18:45:54.090344       9 spark_pod_eventhandler.go:77] Pod spark-pi-b3ca227e7df53ff3-exec-1 deleted in namespace env-z83u9r.
I0121 18:45:54.090357       9 spark_pod_eventhandler.go:95] Enqueuing SparkApplication env-z83u9r/spark-pi for app update processing.
I0121 18:45:54.090369       9 controller.go:263] Starting processing key: "env-z83u9r/spark-pi"
I0121 18:45:54.090407       9 controller.go:270] Ending processing key: "env-z83u9r/spark-pi"

kevinlmadison avatar Jan 21 '22 19:01 kevinlmadison

I grabbed the logs from the executor pod before it terminates.

╰$ k logs -n env-z83u9r -f spark-pi-3d89887e7e908a0c-exec-1 
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID)
+ exec /usr/bin/tini -s -- /usr/local/openjdk-11/bin/java -Dspark.driver.blockManager.port=7079 -Dspark.driver.port=7078 -Xms512m -Xmx512m -cp '/opt/spark/conf::/opt/spark/jars/*:' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc:7078 --executor-id 1 --cores 1 --app-id spark-ce8ab7c4233c46fca4c0e1038e317d49 --hostname 10.0.50.105 --resourceProfileId 0
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/01/21 21:35:16 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 15@spark-pi-3d89887e7e908a0c-exec-1
22/01/21 21:35:16 INFO SignalUtils: Registering signal handler for TERM
22/01/21 21:35:16 INFO SignalUtils: Registering signal handler for HUP
22/01/21 21:35:16 INFO SignalUtils: Registering signal handler for INT
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/01/21 21:35:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/21 21:35:17 INFO SecurityManager: Changing view acls to: 185,root
22/01/21 21:35:17 INFO SecurityManager: Changing modify acls to: 185,root
22/01/21 21:35:17 INFO SecurityManager: Changing view acls groups to: 
22/01/21 21:35:17 INFO SecurityManager: Changing modify acls groups to: 
22/01/21 21:35:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(185, root); groups with view permissions: Set(); users  with modify permissions: Set(185, root); groups with modify permissions: Set()
22/01/21 21:35:18 INFO TransportClientFactory: Successfully created connection to spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc/10.0.50.110:7078 after 186 ms (0 ms spent in bootstraps)
22/01/21 21:35:18 INFO SecurityManager: Changing view acls to: 185,root
22/01/21 21:35:18 INFO SecurityManager: Changing modify acls to: 185,root
22/01/21 21:35:18 INFO SecurityManager: Changing view acls groups to: 
22/01/21 21:35:18 INFO SecurityManager: Changing modify acls groups to: 
22/01/21 21:35:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(185, root); groups with view permissions: Set(); users  with modify permissions: Set(185, root); groups with modify permissions: Set()
22/01/21 21:35:18 INFO TransportClientFactory: Successfully created connection to spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc/10.0.50.110:7078 after 6 ms (0 ms spent in bootstraps)
22/01/21 21:35:19 INFO DiskBlockManager: Created local directory at /var/data/spark-66a6906c-47a7-4545-9327-47c47548b91a/blockmgr-9fcc69cb-702e-4da9-9c05-392007a4a938
22/01/21 21:35:19 INFO MemoryStore: MemoryStore started with capacity 117.0 MiB
22/01/21 21:35:19 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc:7078
22/01/21 21:35:19 INFO ResourceUtils: ==============================================================
22/01/21 21:35:19 INFO ResourceUtils: No custom resources configured for spark.executor.
22/01/21 21:35:19 INFO ResourceUtils: ==============================================================
22/01/21 21:35:19 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
22/01/21 21:35:19 INFO Executor: Starting executor ID 1 on host 10.0.50.105
22/01/21 21:35:19 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43781.
22/01/21 21:35:19 INFO NettyBlockTransferService: Server created on 10.0.50.105:43781
22/01/21 21:35:19 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/01/21 21:35:19 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, 10.0.50.105, 43781, None)
22/01/21 21:35:19 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, 10.0.50.105, 43781, None)
22/01/21 21:35:19 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, 10.0.50.105, 43781, None)
22/01/21 21:35:19 INFO Executor: Fetching file:/opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar with timestamp 1642800906140
22/01/21 21:35:19 INFO Utils: Copying /opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar to /var/data/spark-66a6906c-47a7-4545-9327-47c47548b91a/spark-f063f5c7-3790-49be-bbfc-033aa5ebf1b6/15615854841642800906140_cache
22/01/21 21:35:19 INFO Utils: Copying /var/data/spark-66a6906c-47a7-4545-9327-47c47548b91a/spark-f063f5c7-3790-49be-bbfc-033aa5ebf1b6/15615854841642800906140_cache to /opt/spark/work-dir/./spark-examples_2.12-3.1.1.jar
22/01/21 21:35:19 INFO Executor: Adding file:/opt/spark/work-dir/./spark-examples_2.12-3.1.1.jar to class loader
22/01/21 21:35:21 INFO CoarseGrainedExecutorBackend: Got assigned task 0
22/01/21 21:35:21 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/01/21 21:35:21 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
22/01/21 21:35:21 INFO TransportClientFactory: Successfully created connection to spark-pi-86f4997e7e905879-driver-svc.env-z83u9r.svc/10.0.50.110:7079 after 9 ms (0 ms spent in bootstraps)
22/01/21 21:35:21 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1816.0 B, free 117.0 MiB)
22/01/21 21:35:21 INFO TorrentBroadcast: Reading broadcast variable 0 took 213 ms
22/01/21 21:35:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.1 KiB, free 117.0 MiB)
22/01/21 21:35:22 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1000 bytes result sent to driver
22/01/21 21:35:22 INFO CoarseGrainedExecutorBackend: Got assigned task 1
22/01/21 21:35:22 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
22/01/21 21:35:22 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 957 bytes result sent to driver
22/01/21 21:35:22 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
22/01/21 21:35:22 INFO MemoryStore: MemoryStore cleared
22/01/21 21:35:22 INFO BlockManager: BlockManager stopped
22/01/21 21:35:22 INFO ShutdownHookManager: Shutdown hook called
22/01/21 21:35:22 INFO ShutdownHookManager: Deleting directory /var/data/spark-66a6906c-47a7-4545-9327-47c47548b91a/spark-f063f5c7-3790-49be-bbfc-033aa5ebf1b6
rpc error: code = NotFound desc = could not find container "bcf109cfdf10a107224e316cd6effc792049be6a3c5c67daac91db76c45e9fa8": container with ID starting with bcf109cfdf10a107224e316cd6effc792049be6a3c5c67daac91db76c45e9fa8 not found: ID does not exist%                                                                                                        

kevinlmadison avatar Jan 21 '22 21:01 kevinlmadison

@kevinlmadison hey, were you able to resolve this issue?

stym06 avatar Apr 08 '22 07:04 stym06

@kevinlmadison Did you resolve the issue? What was the fix ?

amalkhed avatar Apr 29 '22 09:04 amalkhed

@kevinlmadison I had the same problem. You can check if your namespace has istio-injection enabled with the following command: kubectl describe ns/<your namespace> If you see "istio-injection=enabled", then you need to explicitly turn off istio-injection in your SparkApplication Yaml file. For Example:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: <your namespace>
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v3.1.1"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
  sparkVersion: "3.1.1"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    annotations:
      sidecar.istio.io/inject: "false"
    serviceAccount: <your sa>
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.1.1
    annotations:
      sidecar.istio.io/inject: "false"
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

yaozhong-liu48 avatar Jul 16 '22 16:07 yaozhong-liu48

@kevinlmadison I had the same problem. You can check if your namespace has istio-injection enabled with the following command: kubectl describe ns/<your namespace> If you see "istio-injection=enabled", then you need to explicitly turn off istio-injection in your SparkApplication Yaml file. For Example:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: <your namespace>
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v3.1.1"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
  sparkVersion: "3.1.1"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    annotations:
      sidecar.istio.io/inject: "false"
    serviceAccount: <your sa>
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.1.1
    annotations:
      sidecar.istio.io/inject: "false"
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Unfortunately didn't work for me because istio-injection is not set. Keep getting this error thrown and I have no clue where to look or what to troubleshoot.

Anyone found a root cause or a possible (other) solution?

BesartSulejmani avatar Nov 22 '22 11:11 BesartSulejmani