spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

driver Pod is running, But executors Excute did not start Caused by: java.net.SocketTimeoutException: timeout

Open allenhaozi opened this issue 2 years ago • 2 comments

++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ '[' -z '' ']'
++ java -XshowSettings:properties -version
++ grep java.home
++ awk '{print $3}'
+ JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+ SPARK_CLASSPATH=':/usr/local/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/usr/local/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /usr/local/spark/bin/spark-submit --conf spark.driver.bindAddress=10.42.3.233 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///home/ailake/work/src/LoadCustomer.py --assign_time 2022-06-15
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.2.1-bin-hadoop3.2/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/06/28 08:01:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/06/28 08:01:50 INFO SparkContext: Running Spark version 3.2.1
22/06/28 08:01:51 INFO ResourceUtils: ==============================================================
22/06/28 08:01:51 INFO ResourceUtils: No custom resources configured for spark.driver.
22/06/28 08:01:51 INFO ResourceUtils: ==============================================================
22/06/28 08:01:51 INFO SparkContext: Submitted application: MyApp
22/06/28 08:01:51 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 8, script: , vendor: , memory -> name: memory, amount: 4000, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/06/28 08:01:51 INFO ResourceProfile: Limiting resource is cpus at 8 tasks per executor
22/06/28 08:01:51 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/06/28 08:01:51 INFO SecurityManager: Changing view acls to: root,ailake
22/06/28 08:01:51 INFO SecurityManager: Changing modify acls to: root,ailake
22/06/28 08:01:51 INFO SecurityManager: Changing view acls groups to: 
22/06/28 08:01:51 INFO SecurityManager: Changing modify acls groups to: 
22/06/28 08:01:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, ailake); groups with view permissions: Set(); users  with modify permissions: Set(root, ailake); groups with modify permissions: Set()
22/06/28 08:01:51 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
22/06/28 08:01:51 INFO SparkEnv: Registering MapOutputTracker
22/06/28 08:01:51 INFO SparkEnv: Registering BlockManagerMaster
22/06/28 08:01:51 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/06/28 08:01:51 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/06/28 08:01:51 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/06/28 08:01:51 INFO DiskBlockManager: Created local directory at /var/data/spark-eb48620c-3ce4-485c-8b76-8efff63a2e7d/blockmgr-d9067949-8e03-49c1-86c3-4ad0596da84c
22/06/28 08:01:51 INFO MemoryStore: MemoryStore started with capacity 1020.0 MiB
22/06/28 08:01:51 INFO SparkEnv: Registering OutputCommitCoordinator
22/06/28 08:01:51 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/06/28 08:01:52 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc:4040
22/06/28 08:01:52 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
22/06/28 08:01:53 INFO ExecutorPodsAllocator: Going to request 3 executors from Kubernetes for ResourceProfile Id: 0, target: 3, known: 0, sharedSlotFromPendingPods: 2147483647.
22/06/28 08:01:53 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script
22/06/28 08:01:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
22/06/28 08:01:53 INFO NettyBlockTransferService: Server created on load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc:7079
22/06/28 08:01:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/06/28 08:01:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc, 7079, None)
22/06/28 08:01:53 INFO BlockManagerMasterEndpoint: Registering block manager load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc:7079 with 1020.0 MiB RAM, BlockManagerId(driver, load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc, 7079, None)
22/06/28 08:01:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc, 7079, None)
22/06/28 08:01:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc, 7079, None)
22/06/28 08:02:03 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [airflow]  failed.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:400)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:117)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:138)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:126)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:81)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketTimeoutException: timeout
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:672)
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:680)
        at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:153)
        at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131)
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:133)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.TokenRefreshInterceptor.intercept(TokenRefreshInterceptor.java:42)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createApplicableInterceptors$6(HttpClientUtils.java:284)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
        at okhttp3.RealCall.execute(RealCall.java:93)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:541)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:504)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:292)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:893)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:372)
        ... 21 more
22/06/28 08:02:04 INFO ExecutorPodsAllocator: Going to request 3 executors from Kubernetes for ResourceProfile Id: 0, target: 3, known: 0, sharedSlotFromPendingPods: 2147483647.
22/06/28 08:02:04 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script
22/06/28 08:02:14 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.

allenhaozi avatar Jun 28 '22 13:06 allenhaozi

It looks like the connection to the APIServer has timed out, so check the APIServer logs.

dcoliversun avatar Jul 04 '22 01:07 dcoliversun

thank you @dcoliversun If I kill the driver pod manually, it will work. It is an accidental event

allenhaozi avatar Jul 04 '22 07:07 allenhaozi