autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

capi: fix error when listing infra machines

Open jackfrancis opened this issue 1 year ago • 10 comments

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR affects the CAPI provider of cluster-autoscaler only.

Here we check that we have the proper permissions to list an infra object before we bootstrap the informer factory. Doing the latter has error handling side-effects that we don't want:

  • https://github.com/kubernetes/client-go/blob/v0.29.0-alpha.3/tools/cache/reflector.go#L146-L148

Specfically it results in the error short-circuiting the process and resulting in a pod restart, which has the ultimate side-effect of deadlocking the cluster-autoscaler reconciler.

Which issue(s) this PR fixes:

Fixes #6490

Special notes for your reviewer:

Does this PR introduce a user-facing change?

capi cluster-autoscaler: fix error when listing infra machines

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


jackfrancis avatar Feb 02 '24 00:02 jackfrancis

this makes good sense to me. as to your question on slack about caching, i don't think this would have an impact as if the conditions changed to allow the autoscaler to read the infrastructure object, then it would start caching.

elmiko avatar Feb 02 '24 01:02 elmiko

i still like the change here, but something odd happening with the tests

elmiko avatar Feb 02 '24 17:02 elmiko

@elmiko - looks like you're reviewing this one already, so assigning to you so it doesn't show up as unattended.

/assign @elmiko

x13n avatar Feb 06 '24 10:02 x13n

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 06 '24 22:05 k8s-triage-robot

@jackfrancis should we keep this open?

/remove-lifecycle stale

elmiko avatar May 07 '24 14:05 elmiko

just adding a comment, i'm trying to take a look into the unit test failures.

elmiko avatar Jun 27 '24 12:06 elmiko

i'm baffled by the behavior here, going to need more time to understand how the clientset interaction is affecting this.

elmiko avatar Jun 27 '24 12:06 elmiko

@elmiko thanks for taking a look at this while I've been away :)

lemme know if you discover anything, happy to hop onto a zoom and pair this out as well if that helps

jackfrancis avatar Jun 27 '24 23:06 jackfrancis

i'll keep poking at it and ask some colleagues, but i will be off on vacation next week so i probably won't get back to this until the second week of july.

elmiko avatar Jun 28 '24 08:06 elmiko

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 26 '24 08:09 k8s-triage-robot

apologies @jackfrancis , this fell off my radar

/remove-lifecycle stale

elmiko avatar Sep 26 '24 13:09 elmiko

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 30 '24 20:12 k8s-triage-robot

@jackfrancis i have completely lost track of this issue. is this something we should continue to play kick-the-can with?

fwiw, i was never able to get answers to the questions i had internally.

elmiko avatar Jan 09 '25 14:01 elmiko

@elmiko I rebased this as an excuse to put this back on my radar, lik you I haven't thought about this in many months :/

jackfrancis avatar Jan 14 '25 00:01 jackfrancis

/remove-lifecycle stale

Shubham82 avatar Jan 23 '25 10:01 Shubham82

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 02 '25 18:07 k8s-triage-robot

/remove-lifecycle stale

@jackfrancis i think we still want this?

elmiko avatar Jul 02 '25 19:07 elmiko

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 03 '25 20:07 k8s-ci-robot

@elmiko thanks for the nudge, now let's see if we can figure out the UT situation before it goes stale again :)

jackfrancis avatar Jul 03 '25 22:07 jackfrancis

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 12 '25 23:08 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 11 '25 00:11 k8s-triage-robot