pulumi-kubernetes icon indicating copy to clipboard operation
pulumi-kubernetes copied to clipboard

Kubernetes client rate-limiting

Open EronWright opened this issue 1 year ago • 2 comments

What happened?

I was running the kubernetes provider in a debugger, and attaching to it using PULUMI_DEBUG_PROVIDERS. I used the same process for numerous deployments, and eventually the provider transitioned to a failure state, apparently due to client-side rate limiting. Once I restarted the provider process, the problem was fixed.

I decided to file an issue because, though my specific case is exotic, there might be a deeper scalability problem in the provider related to rate-limiting in the kube client. See https://github.com/kubernetes/kubernetes/issues/111880 for more background.

Diagnostics:
  kubernetes:apps/v1:Deployment (deployment):
    error: update of resource "urn:pulumi:dev::issue-xyz::kubernetes:apps/v1:Deployment::deployment" failed 
    because the Kubernetes API server reported that it failed to fully initialize or become live: 
    client rate limiter Wait returned an error: context canceled

  pulumi:pulumi:Stack (issue-xyz-dev):
    error: update failed

Here's the update made just prior to the first rate-limit error. I'd deliberately used an invalid image nginxfoo.

Diagnostics:
  kubernetes:apps/v1:Deployment (deployment):
    warning: Refreshed resource is in an unhealthy state:
    * Resource 'mydeployment' was created but failed to initialize
    * Minimum number of Pods to consider the application live was not attained
    * [Pod eron/mydeployment-65df56c569-dnqzh]: containers with unready status: [nginx]
    error: update of resource "urn:pulumi:dev::issue-2455::kubernetes:apps/v1:Deployment::deployment" failed because the Kubernetes API server reported that it failed to fully initialize or become live: Resource operation was cancelled for "mydeployment"

Example

name: issue-2942
runtime: yaml
description: A minimal Kubernetes Pulumi YAML program
config:
  pulumi:tags:
    value:
      pulumi:template: kubernetes-yaml
outputs:
  name: ${deployment.metadata.name}
resources:
  deployment:
    properties:
      metadata:
        name: mydeployment
      spec:
        replicas: 1
        selector:
          matchLabels: ${appLabels}
        template:
          metadata:
            labels: ${appLabels}
          spec:
            containers:
            - image: nginx
              name: nginx
              env:
              - name: DEMO_GREETING
                value: "16"
    type: kubernetes:apps/v1:Deployment
variables:
  appLabels:
    app: nginx

N/A

Output of pulumi about

CLI          
Version      3.108.1
Go Version   go1.22.0
Go Compiler  gc

Plugins
NAME        VERSION
kubernetes  unknown
yaml        unknown

Host     
OS       darwin
Version  14.4.1
Arch     arm64

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

EronWright avatar Apr 11 '24 18:04 EronWright

Here's what happened in my case: the provider was sent a Cancel RPC, causing the provider's internal context to be canceled. In subsequent requests, the kube client logic is the first to hit upon the cancelled context.

Two possible follow-ups:

  1. double-check the qps settings
  2. teach the provider to reset the cancelation signal when it receives Configure RPC.

The low-level throttling code is here:

https://github.com/kubernetes/client-go/blob/46588f2726fa3e25b1704d6418190f424f95a990/rest/request.go#L986-L991

EronWright avatar Apr 11 '24 19:04 EronWright

Is there another alternative where we generously bump the QPS ceiling if running under debug? A quick workaround like that might be prudent if this is impacting the debug loop but not end-users.

Related https://github.com/pulumi/pulumi-kubernetes/pull/1748

blampe avatar Apr 12 '24 15:04 blampe