java
java copied to clipboard
Informer ReflectorRunnable doesn't recover from "Too large resource version"
Describe the bug
Informer ReflectorRunnable doesn't recover from "Too large resource version".
ApiException.getResponseBody is like:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Timeout: Too large resource version: 5399771, current: 5399139","reason":"Timeout","details":{"causes":[{"reason":"ResourceVersionTooLarge","message":"Too large resource version"}],"retryAfterSeconds":1},"code":504}
Based on my understanding, this could happen at list call when the resource version param is too large for that API server / ETCD instance.
We should set isLastSyncResourceVersionUnavailable = true; in this case.
See go client: https://github.com/kubernetes/client-go/commit/ec46b97af413cab12270fccbc09bcc69c63e372e
Client Version
All versions including 13 and 14.
Kubernetes Version
1.21
Java Version n/a
To Reproduce
Run informer and make API server return ResourceVersionTooLarge.
Expected behavior
Set isLastSyncResourceVersionUnavailable and redo consistent read from ETCD.
KubeConfig n/a
Server (please complete the following information): n/a
Additional context n/a
Why is this exception returning 504 (timeout) instead of 410 (gone)
The current code is only looking at the status code: https://github.com/kubernetes-client/java/blob/master/util/src/main/java/io/kubernetes/client/informer/cache/ReflectorRunnable.java#L172
While we could do more here, I'm confused about why this is a timeout HTTP status code.
@brendandburns interestingly the expected status code is indeed 504 here https://github.com/kubernetes/kubernetes/blob/f14d1c9b1ef2b3b332d6b83d10da27fe3855acad/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L402. i will add a check for handling this kind of error
https://github.com/kubernetes/kubernetes/pull/94316 this is the golang version of the fix we're looking for
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen - Mark this issue or PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen- Mark this issue or PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@yue9944882 What's the current status of this? It looks like it got closed without being fixed?
BTW, in answer to the question from brendandburns above
Why is this exception returning 504 (timeout) instead of 410 (gone)
I can only guess, but I think when the requested RV is larger than what the API Server is currently aware of, returning "gone" was probably considered wrong. Since "gone" implies "it was here, but it's not here any more". That's different from this situation, where we need a response that says "It's not here yet".
Still relevant @yue9944882
Is there a solution to this problem, it seems to be closed without processing