kubernetes Throttled response from API server should not include misleading cause information

What happened?

When the Kubernetes API server returns a 429 (throttled), the response may include misleading cause information.

We observed that a pod eviction API call received a response that included both of the following fragments:

"responseStatus": {
  "metadata": {},
  "status": "Failure",
  "reason": "TooManyRequests",
  "code": 429
},

and

"responseObject": {
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Cannot evict pod as it would violate the pod's disruption budget.",
  "reason": "TooManyRequests",
  "details": {
    "causes": [
      {
        "reason": "DisruptionBudget",
        "message": "The disruption budget [elided] needs 11 healthy pods and has 12 currently"
      }
    ]
  },
  "code": 429
},

Note that the message given above is confusing, as there are more currently healthy pods than the required number.

What did you expect to happen?

When a request is rejected due to throttling (429 HTTP response), either

no other error information should be included; or
the included error information should indicate throttling

How can we reproduce it (as minimally and precisely as possible)?

Provoke a 429 HTTP response and inspect the response.

Anything else we need to know?

Issue #88535 seems to be similar - it also describes a 429 response where inclusion of a cause resulted in confusion.

Kubernetes version

Version: 1.20.9

Cloud provider

Azure Kubernetes Service

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Nov 10 '21 01:11 theunrepentantgeek

@theunrepentantgeek: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 10 '21 01:11 k8s-ci-robot

/sig api-machinery

Nov 10 '21 01:11 theunrepentantgeek

/remove-sig api-machinery /sig apps for PDB

Nov 12 '21 21:11 liggitt

@liggitt , are you sure?

As far as I've been able to work out, the 429 HTTP response is correct and the PDB error is a red herring.

The PDB error we observed doesn't even make sense, given the number of healthy pods (12) was more than the needed (11):

The disruption budget [elided] needs 11 healthy pods and has 12 currently

(My experience is that I spent a bunch of time digging into possible causes of the PDB error before realizing that I'd been ignoring the HTTP response code - and once I started paying attention to that, everything started to make sense. Of course, you understand this area far better, I just don't want you to fall into the same trap that I did!)

Nov 12 '21 22:11 theunrepentantgeek

any message coming back with PDB language in it belongs to sig-apps

Nov 12 '21 22:11 liggitt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 10 '22 23:02 k8s-triage-robot

/remove-lifecycle stale

Feb 11 '22 06:02 theunrepentantgeek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 12 '22 07:05 k8s-triage-robot

/remove-lifecycle stale

May 12 '22 09:05 theunrepentantgeek

/remove-lifecycle stale

May 12 '22 21:05 theunrepentantgeek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 10 '22 22:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 09 '22 22:09 k8s-triage-robot

/remove-lifecycle rotten

Sep 09 '22 23:09 theunrepentantgeek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 08 '22 23:12 k8s-triage-robot

/remove-lifecycle stale

Dec 12 '22 01:12 theunrepentantgeek

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 12 '23 02:03 k8s-triage-robot

/remove-lifecycle stale

Mar 12 '23 09:03 theunrepentantgeek

/triage accepted /priority important-longterm

It looks like eviction has some creative uses of TooManyRequests errors. Why is this conflict error converted to TooManyRequests?

https://github.com/kubernetes/kubernetes/blob/786316f0b6bb9078cd564ebf5401bb2e9ac7f2a2/pkg/registry/core/pod/storage/eviction.go#L294-L305

Also this observed generation mismatch (stale cache?):

https://github.com/kubernetes/kubernetes/blob/786316f0b6bb9078cd564ebf5401bb2e9ac7f2a2/pkg/registry/core/pod/storage/eviction.go#L419-L422

Violating the disruption budget is more debatable, but I still don't think a 429 error is right:

https://github.com/kubernetes/kubernetes/blob/786316f0b6bb9078cd564ebf5401bb2e9ac7f2a2/pkg/registry/core/pod/storage/eviction.go#L429-L433

Mar 21 '23 21:03 tallclair

It looks like eviction has some creative uses of TooManyRequests errors. Why is this conflict error converted to TooManyRequests?

that was added in https://github.com/kubernetes/kubernetes/pull/94381/files#r507982781 with some discussion there

I think the short answer is we wanted to drive an automatic retry (which the server returning TooManyRequests with a RetryAfter does), and returning a conflict when the user did not set a precondition on the request did not seem correct

Mar 21 '23 21:03 liggitt

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Mar 20 '24 22:03 k8s-triage-robot

kubernetes kubernetes copied to clipboard

Throttled response from API server should not include misleading cause information

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

kubernetes
kubernetes copied to clipboard