java icon indicating copy to clipboard operation
java copied to clipboard

Informer ReflectorRunnable doesn't recover from "Too large resource version"

Open haoming-db opened this issue 3 years ago • 4 comments
trafficstars

Describe the bug Informer ReflectorRunnable doesn't recover from "Too large resource version". ApiException.getResponseBody is like:

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Timeout: Too large resource version: 5399771, current: 5399139","reason":"Timeout","details":{"causes":[{"reason":"ResourceVersionTooLarge","message":"Too large resource version"}],"retryAfterSeconds":1},"code":504}

Based on my understanding, this could happen at list call when the resource version param is too large for that API server / ETCD instance. We should set isLastSyncResourceVersionUnavailable = true; in this case. See go client: https://github.com/kubernetes/client-go/commit/ec46b97af413cab12270fccbc09bcc69c63e372e

Client Version All versions including 13 and 14.

Kubernetes Version 1.21

Java Version n/a

To Reproduce Run informer and make API server return ResourceVersionTooLarge.

Expected behavior Set isLastSyncResourceVersionUnavailable and redo consistent read from ETCD.

KubeConfig n/a

Server (please complete the following information): n/a

Additional context n/a

haoming-db avatar Mar 29 '22 21:03 haoming-db

Why is this exception returning 504 (timeout) instead of 410 (gone)

The current code is only looking at the status code: https://github.com/kubernetes-client/java/blob/master/util/src/main/java/io/kubernetes/client/informer/cache/ReflectorRunnable.java#L172

While we could do more here, I'm confused about why this is a timeout HTTP status code.

brendandburns avatar Mar 29 '22 22:03 brendandburns

@brendandburns interestingly the expected status code is indeed 504 here https://github.com/kubernetes/kubernetes/blob/f14d1c9b1ef2b3b332d6b83d10da27fe3855acad/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L402. i will add a check for handling this kind of error

yue9944882 avatar Mar 30 '22 18:03 yue9944882

https://github.com/kubernetes/kubernetes/pull/94316 this is the golang version of the fix we're looking for

yue9944882 avatar Mar 30 '22 18:03 yue9944882

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 18 '22 17:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 17 '22 18:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Sep 16 '22 18:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 16 '22 18:09 k8s-ci-robot

@yue9944882 What's the current status of this? It looks like it got closed without being fixed?

JohnRusk avatar Jan 30 '23 07:01 JohnRusk

BTW, in answer to the question from brendandburns above

Why is this exception returning 504 (timeout) instead of 410 (gone)

I can only guess, but I think when the requested RV is larger than what the API Server is currently aware of, returning "gone" was probably considered wrong. Since "gone" implies "it was here, but it's not here any more". That's different from this situation, where we need a response that says "It's not here yet".

JohnRusk avatar Jan 30 '23 07:01 JohnRusk

Still relevant @yue9944882

vitality411 avatar Feb 27 '23 06:02 vitality411

Is there a solution to this problem, it seems to be closed without processing

weihubeats avatar May 05 '23 03:05 weihubeats