kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

Throttled response from API server should not include misleading cause information

Open theunrepentantgeek opened this issue 4 years ago • 11 comments

What happened?

When the Kubernetes API server returns a 429 (throttled), the response may include misleading cause information.

We observed that a pod eviction API call received a response that included both of the following fragments:

"responseStatus": {
  "metadata": {},
  "status": "Failure",
  "reason": "TooManyRequests",
  "code": 429
},

and

"responseObject": {
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Cannot evict pod as it would violate the pod's disruption budget.",
  "reason": "TooManyRequests",
  "details": {
    "causes": [
      {
        "reason": "DisruptionBudget",
        "message": "The disruption budget [elided] needs 11 healthy pods and has 12 currently"
      }
    ]
  },
  "code": 429
},

Note that the message given above is confusing, as there are more currently healthy pods than the required number.

What did you expect to happen?

When a request is rejected due to throttling (429 HTTP response), either

  • no other error information should be included; or
  • the included error information should indicate throttling

How can we reproduce it (as minimally and precisely as possible)?

Provoke a 429 HTTP response and inspect the response.

Anything else we need to know?

Issue #88535 seems to be similar - it also describes a 429 response where inclusion of a cause resulted in confusion.

Kubernetes version

Version: 1.20.9

Cloud provider

Azure Kubernetes Service

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

theunrepentantgeek avatar Nov 10 '21 01:11 theunrepentantgeek

@theunrepentantgeek: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 10 '21 01:11 k8s-ci-robot

/sig api-machinery

theunrepentantgeek avatar Nov 10 '21 01:11 theunrepentantgeek

/remove-sig api-machinery /sig apps for PDB

liggitt avatar Nov 12 '21 21:11 liggitt

@liggitt , are you sure?

As far as I've been able to work out, the 429 HTTP response is correct and the PDB error is a red herring.

The PDB error we observed doesn't even make sense, given the number of healthy pods (12) was more than the needed (11):

The disruption budget [elided] needs 11 healthy pods and has 12 currently

(My experience is that I spent a bunch of time digging into possible causes of the PDB error before realizing that I'd been ignoring the HTTP response code - and once I started paying attention to that, everything started to make sense. Of course, you understand this area far better, I just don't want you to fall into the same trap that I did!)

theunrepentantgeek avatar Nov 12 '21 22:11 theunrepentantgeek

any message coming back with PDB language in it belongs to sig-apps

liggitt avatar Nov 12 '21 22:11 liggitt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 10 '22 23:02 k8s-triage-robot

/remove-lifecycle stale

theunrepentantgeek avatar Feb 11 '22 06:02 theunrepentantgeek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 12 '22 07:05 k8s-triage-robot

/remove-lifecycle stale

theunrepentantgeek avatar May 12 '22 09:05 theunrepentantgeek

/remove-lifecycle stale

theunrepentantgeek avatar May 12 '22 21:05 theunrepentantgeek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 10 '22 22:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 09 '22 22:09 k8s-triage-robot

/remove-lifecycle rotten

theunrepentantgeek avatar Sep 09 '22 23:09 theunrepentantgeek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 08 '22 23:12 k8s-triage-robot

/remove-lifecycle stale

theunrepentantgeek avatar Dec 12 '22 01:12 theunrepentantgeek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 12 '23 02:03 k8s-triage-robot

/remove-lifecycle stale

theunrepentantgeek avatar Mar 12 '23 09:03 theunrepentantgeek

/triage accepted /priority important-longterm

It looks like eviction has some creative uses of TooManyRequests errors. Why is this conflict error converted to TooManyRequests?

https://github.com/kubernetes/kubernetes/blob/786316f0b6bb9078cd564ebf5401bb2e9ac7f2a2/pkg/registry/core/pod/storage/eviction.go#L294-L305

Also this observed generation mismatch (stale cache?):

https://github.com/kubernetes/kubernetes/blob/786316f0b6bb9078cd564ebf5401bb2e9ac7f2a2/pkg/registry/core/pod/storage/eviction.go#L419-L422

Violating the disruption budget is more debatable, but I still don't think a 429 error is right:

https://github.com/kubernetes/kubernetes/blob/786316f0b6bb9078cd564ebf5401bb2e9ac7f2a2/pkg/registry/core/pod/storage/eviction.go#L429-L433

tallclair avatar Mar 21 '23 21:03 tallclair

It looks like eviction has some creative uses of TooManyRequests errors. Why is this conflict error converted to TooManyRequests?

that was added in https://github.com/kubernetes/kubernetes/pull/94381/files#r507982781 with some discussion there

I think the short answer is we wanted to drive an automatic retry (which the server returning TooManyRequests with a RetryAfter does), and returning a conflict when the user did not set a precondition on the request did not seem correct

liggitt avatar Mar 21 '23 21:03 liggitt

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Mar 20 '24 22:03 k8s-triage-robot