spark-operator
spark-operator copied to clipboard
KubernetesClientException: too old resource version
Image used : gcr.io/spark-operator/spark:v3.1.1
kubernetes client jar: kubernetes-client-4.12.0.jar
We are getting this issue intermittently
Relevant Logs:
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:258)
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Any pointers on how to fix this?
Relevant discussion - https://issues.apache.org/jira/browse/SPARK-33349
We're seeing the exact same issue with pyspark v3.2.1. The streaming jobs are just stalling instead of the driver quitting so the job can start again.
22/03/28 19:57:19 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 1499049025 (1499196141)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103)
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1499049025 (1499196141)
... 11 more
Update: This message usually appears 10-20 minutes after the log item right before it so it may be a red herring. Unfortunately there are no errors in the log before it so there's nothing else obvious to share.
@cmoad what's your k8 client and server version?
Server version: v1.21.9 Client version: kubernetes-client-5.4.1.jar
Should be good based on the compatibility matrix: https://github.com/fabric8io/kubernetes-client#kubernetes-compatibility-matrix
Faced the same issue. Spark hangs forever right after the writing to parquet stage ended.
kubernetes-client-5.4.1 Server Version: version.Info{
Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9"
Fwiw, we now believe this error was a red herring. We found the true cause to be a quiet OOM on an executor and this error seemed to happen a while later in the driver logs. Anyone else seeing this should heavily look for errors happening ~20-30 seconds before.
Yep, I had OOM issues that preceded the above error. Bumping up the version to 5.50 helped to catch it. No app hanging, just usual OOM crash.
Server version: v1.20.15-eks-84b4fe6 Client version: kubernetes-client-5.4.1.jar We are getting this issue intermittently
2022/08/28 19:51:20 INFO SparkTBinaryFrontendService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 2022/08/28 19:51:20 INFO SparkSQLSessionManager: Opening session for [email protected] 2022/08/28 19:51:20 WARN SparkSessionImpl: Cannot modify the value of a Spark config: spark.driver.memory 2022/08/28 19:51:20 INFO SparkSQLSessionManager: hive's session with SessionHandle [4465a33f-8ac6-4f0b-bccd-dd4702ee0b7b] is opened, current opening sessions 1 2022/08/28 20:42:34 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) io.fabric8.kubernetes.client.WatcherException: too old resource version: 61967434 (62009935) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103) at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 61967434 (62009935) ... 11 more
Are your set AllowWatchBookmarks param for watch opt ? https://kubernetes.io/docs/reference/using-api/api-concepts/
Hello,
I have a similar situation with my spark application, here are the relevant logs:
22/10/20 11:10:32 INFO BlockManagerMasterEndpoint: Registering block manager 10.144.4.100:46847 with 2.2 GiB RAM, BlockManagerId(1, 10.144.4.100, 46847, None)
22/10/20 11:10:33 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/10/20 11:10:33 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
22/10/20 11:44:46 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 58557205 (58564779)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103)
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 58557205 (58564779)
... 11 more
22/10/20 12:01:37 INFO CodeGenerator: Code generated in 516.484966 ms
I don't have any other exception or error message in my logs, driver or executors. Is there something I can try?
Thank you.
Hello All, I am also facing same issue, Anyone having any workaround solution for it ?
I would recommend we close this issue. Several of us have reported this error message as downstream from the true, critical failure.
@cmoad what is the solution for this?
if you are using gcs, after upgrading to 3.3.0 I no longer see the too old resource version
and started seeing see that the "hanging" behavior was actually spark repairing a bunch of directories in my bucket. https://groups.google.com/g/cloud-dataproc-discuss/c/JKcimdnskJc recommends setting "fs.gs.implicit.dir.repair.enable"
to False