java Informer ReflectorRunnable doesn't recover from "Too large resource version"

trafficstars

Describe the bug Informer ReflectorRunnable doesn't recover from "Too large resource version". ApiException.getResponseBody is like:

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Timeout: Too large resource version: 5399771, current: 5399139","reason":"Timeout","details":{"causes":[{"reason":"ResourceVersionTooLarge","message":"Too large resource version"}],"retryAfterSeconds":1},"code":504}

Based on my understanding, this could happen at list call when the resource version param is too large for that API server / ETCD instance. We should set isLastSyncResourceVersionUnavailable = true; in this case. See go client: https://github.com/kubernetes/client-go/commit/ec46b97af413cab12270fccbc09bcc69c63e372e

Client Version All versions including 13 and 14.

Kubernetes Version 1.21

Java Version n/a

To Reproduce Run informer and make API server return ResourceVersionTooLarge.

Expected behavior Set isLastSyncResourceVersionUnavailable and redo consistent read from ETCD.

KubeConfig n/a

Server (please complete the following information): n/a

Additional context n/a

Mar 29 '22 21:03 haoming-db

Why is this exception returning 504 (timeout) instead of 410 (gone)

The current code is only looking at the status code: https://github.com/kubernetes-client/java/blob/master/util/src/main/java/io/kubernetes/client/informer/cache/ReflectorRunnable.java#L172

While we could do more here, I'm confused about why this is a timeout HTTP status code.

Mar 29 '22 22:03 brendandburns

@brendandburns interestingly the expected status code is indeed 504 here https://github.com/kubernetes/kubernetes/blob/f14d1c9b1ef2b3b332d6b83d10da27fe3855acad/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L402. i will add a check for handling this kind of error

Mar 30 '22 18:03 yue9944882

https://github.com/kubernetes/kubernetes/pull/94316 this is the golang version of the fix we're looking for

Mar 30 '22 18:03 yue9944882

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 18 '22 17:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 17 '22 18:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Sep 16 '22 18:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 16 '22 18:09 k8s-ci-robot

@yue9944882 What's the current status of this? It looks like it got closed without being fixed?

Jan 30 '23 07:01 JohnRusk

BTW, in answer to the question from brendandburns above

Why is this exception returning 504 (timeout) instead of 410 (gone)

I can only guess, but I think when the requested RV is larger than what the API Server is currently aware of, returning "gone" was probably considered wrong. Since "gone" implies "it was here, but it's not here any more". That's different from this situation, where we need a response that says "It's not here yet".

Jan 30 '23 07:01 JohnRusk

Still relevant @yue9944882

Feb 27 '23 06:02 vitality411

Is there a solution to this problem, it seems to be closed without processing

May 05 '23 03:05 weihubeats

java java copied to clipboard

Informer ReflectorRunnable doesn't recover from "Too large resource version"

java
java copied to clipboard