spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

KubernetesClientException: too old resource version

Open devender-yadav opened this issue 2 years ago • 9 comments

Image used : gcr.io/spark-operator/spark:v3.1.1 kubernetes client jar: kubernetes-client-4.12.0.jar

We are getting this issue intermittently

Relevant Logs:

 io.fabric8.kubernetes.client.KubernetesClientException: too old resource version
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:258)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

Any pointers on how to fix this?

devender-yadav avatar Mar 23 '22 13:03 devender-yadav

Relevant discussion - https://issues.apache.org/jira/browse/SPARK-33349

devender-yadav avatar Mar 24 '22 03:03 devender-yadav

We're seeing the exact same issue with pyspark v3.2.1. The streaming jobs are just stalling instead of the driver quitting so the job can start again.

22/03/28 19:57:19 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 1499049025 (1499196141)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1499049025 (1499196141)
	... 11 more

Update: This message usually appears 10-20 minutes after the log item right before it so it may be a red herring. Unfortunately there are no errors in the log before it so there's nothing else obvious to share.

cmoad avatar Mar 28 '22 21:03 cmoad

@cmoad what's your k8 client and server version?

devender-yadav avatar Mar 30 '22 02:03 devender-yadav

Server version: v1.21.9 Client version: kubernetes-client-5.4.1.jar

Should be good based on the compatibility matrix: https://github.com/fabric8io/kubernetes-client#kubernetes-compatibility-matrix

cmoad avatar Mar 30 '22 14:03 cmoad

Faced the same issue. Spark hangs forever right after the writing to parquet stage ended.

kubernetes-client-5.4.1 Server Version: version.Info{

Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9"

devrivne avatar May 07 '22 12:05 devrivne

Fwiw, we now believe this error was a red herring. We found the true cause to be a quiet OOM on an executor and this error seemed to happen a while later in the driver logs. Anyone else seeing this should heavily look for errors happening ~20-30 seconds before.

cmoad avatar May 09 '22 12:05 cmoad

Yep, I had OOM issues that preceded the above error. Bumping up the version to 5.50 helped to catch it. No app hanging, just usual OOM crash.

devrivne avatar May 10 '22 20:05 devrivne

Server version: v1.20.15-eks-84b4fe6 Client version: kubernetes-client-5.4.1.jar We are getting this issue intermittently


2022/08/28 19:51:20 INFO SparkTBinaryFrontendService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 2022/08/28 19:51:20 INFO SparkSQLSessionManager: Opening session for [email protected] 2022/08/28 19:51:20 WARN SparkSessionImpl: Cannot modify the value of a Spark config: spark.driver.memory 2022/08/28 19:51:20 INFO SparkSQLSessionManager: hive's session with SessionHandle [4465a33f-8ac6-4f0b-bccd-dd4702ee0b7b] is opened, current opening sessions 1 2022/08/28 20:42:34 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) io.fabric8.kubernetes.client.WatcherException: too old resource version: 61967434 (62009935) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103) at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 61967434 (62009935) ... 11 more

qshian avatar Aug 29 '22 03:08 qshian

Are your set AllowWatchBookmarks param for watch opt ? https://kubernetes.io/docs/reference/using-api/api-concepts/

wang007 avatar Sep 16 '22 10:09 wang007

Hello,

I have a similar situation with my spark application, here are the relevant logs:

22/10/20 11:10:32 INFO BlockManagerMasterEndpoint: Registering block manager 10.144.4.100:46847 with 2.2 GiB RAM, BlockManagerId(1, 10.144.4.100, 46847, None)
22/10/20 11:10:33 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/10/20 11:10:33 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
22/10/20 11:44:46 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 58557205 (58564779)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 58557205 (58564779)
	... 11 more
22/10/20 12:01:37 INFO CodeGenerator: Code generated in 516.484966 ms

I don't have any other exception or error message in my logs, driver or executors. Is there something I can try?

Thank you.

pedro93 avatar Oct 20 '22 12:10 pedro93

Hello All, I am also facing same issue, Anyone having any workaround solution for it ?

sbbagal13 avatar Mar 01 '23 17:03 sbbagal13

I would recommend we close this issue. Several of us have reported this error message as downstream from the true, critical failure.

cmoad avatar Mar 01 '23 17:03 cmoad

@cmoad what is the solution for this?

sbbagal13 avatar Mar 01 '23 19:03 sbbagal13

if you are using gcs, after upgrading to 3.3.0 I no longer see the too old resource version and started seeing see that the "hanging" behavior was actually spark repairing a bunch of directories in my bucket. https://groups.google.com/g/cloud-dataproc-discuss/c/JKcimdnskJc recommends setting "fs.gs.implicit.dir.repair.enable" to False

noahshpak avatar Aug 03 '23 14:08 noahshpak