cluster-api-provider-packet All machines in Failed status

What steps did you take and what happened: Working on removing reserved hardware from a machine deployment I was in the process of deleting machines while keeping replica count the same. It seems at some point I was possibly rate-limited getting back 403 from the API which caused all the machines to now show as Failed. I seem to be unable to get the machines back into a happy state as the Packet provider is skipping the check on them.

As a side note, there is no way to reduce replica count without other machines being deleted making the removal of specific reserved hardware difficult. This is because the machinedeployment will kill based only on random, oldest, or newest strategies regardless if unprovisioned machines could be removed instead.

What did you expect to happen: I expected to be able to delete the machines out while cluster-api stayed stable at least.

Anything else you would like to add: I assume this happened because we have 100s of reserved hardware ids and it was making API requests for them. Perhaps storing the reservation id in the Status of each packetmachine and checking that first before making an API call out would reduce requests.

I0914 19:18:41.250461       1 packetmachine_controller.go:227] controller/packetmachine "msg"="Reconciling PacketMachine" "cluster"="live-ewr2-mine-k8s-game" "machine"="k8s-game-cp-plxbw" "name"="k8s-game-cp-f66c50-dmr79" "namespace"="live-pkt-ewr2-mine-k8s-game" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="PacketMachine"
I0914 19:18:41.250491       1 packetmachine_controller.go:232] controller/packetmachine "msg"="Error state detected, skipping reconciliation" "cluster"="live-ewr2-mine-k8s-game" "machine"="k8s-game-cp-plxbw" "name"="k8s-game-cp-f66c50-dmr79" "namespace"="live-pkt-ewr2-mine-k8s-game" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="PacketMachine"

$ k get machines,packetmachines -o wide
NAME                                                            CLUSTER                   NODENAME                         PROVIDERID                                            PHASE          AGE     VERSION
machine.cluster.x-k8s.io/k8s-game-cp-2g5mv                      live-ewr2-mine-k8s-game   k8s-game-cp-f66c50-q6hsc         equinixmetal://0c47f32c-d588-4467-ad1b-9e92ec053a29   Failed         5d1h    v1.22.13
machine.cluster.x-k8s.io/k8s-game-cp-gvdfs                      live-ewr2-mine-k8s-game   k8s-game-cp-f66c50-7vx8v         equinixmetal://98961a29-be25-4b65-9083-0bbc06fd1e91   Failed         5d2h    v1.22.13
machine.cluster.x-k8s.io/k8s-game-cp-plxbw                      live-ewr2-mine-k8s-game   k8s-game-cp-f66c50-dmr79         equinixmetal://5b92e444-e399-453d-b342-3ac015724765   Failed         5d1h    v1.22.13
machine.cluster.x-k8s.io/k8s-game-cp-sxpcb                      live-ewr2-mine-k8s-game   k8s-game-cp-f66c50-6h8q4         equinixmetal://5e4a8a41-c2aa-40c2-92dc-d409294b8785   Failed         5d      v1.22.13
machine.cluster.x-k8s.io/k8s-game-cp-vmgqq                      live-ewr2-mine-k8s-game   k8s-game-cp-f66c50-qwjt8         equinixmetal://2724700d-800c-448c-931b-88f58d874e99   Failed         5d2h    v1.22.13
...

NAME                                                                           CLUSTER                   STATE   READY   INSTANCEID                                            MACHINE
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-6h8q4         live-ewr2-mine-k8s-game           true    equinixmetal://5e4a8a41-c2aa-40c2-92dc-d409294b8785   k8s-game-cp-sxpcb
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-7vx8v         live-ewr2-mine-k8s-game           true    equinixmetal://98961a29-be25-4b65-9083-0bbc06fd1e91   k8s-game-cp-gvdfs
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-dmr79         live-ewr2-mine-k8s-game           true    equinixmetal://5b92e444-e399-453d-b342-3ac015724765   k8s-game-cp-plxbw
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-q6hsc         live-ewr2-mine-k8s-game           true    equinixmetal://0c47f32c-d588-4467-ad1b-9e92ec053a29   k8s-game-cp-2g5mv
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-qwjt8         live-ewr2-mine-k8s-game           true    equinixmetal://2724700d-800c-448c-931b-88f58d874e99   k8s-game-cp-vmgqq
...

$ k describe packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-6h8q4
Name:         k8s-game-cp-f66c50-6h8q4
Namespace:    live-pkt-ewr2-mine-k8s-game
Labels:       cluster.x-k8s.io/cluster-name=live-ewr2-mine-k8s-game
              cluster.x-k8s.io/control-plane=
Annotations:  cluster.x-k8s.io/cloned-from-groupkind: PacketMachineTemplate.infrastructure.cluster.x-k8s.io
              cluster.x-k8s.io/cloned-from-name: k8s-game-cp-f66c50
API Version:  infrastructure.cluster.x-k8s.io/v1beta1
Kind:         PacketMachine
Metadata:
  Creation Timestamp:  2022-09-09T15:38:32Z
  Finalizers:
    packetmachine.infrastructure.cluster.x-k8s.io
  Generation:  2
  Owner References:
    API Version:           controlplane.cluster.x-k8s.io/v1beta1
    Kind:                  KubeadmControlPlane
    Name:                  k8s-game-cp
    UID:                   cb220baa-3174-4a9f-b48c-1dcb1e4fb1c6
    API Version:           cluster.x-k8s.io/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Machine
    Name:                  k8s-game-cp-sxpcb
    UID:                   757367c7-8089-4e28-8fca-f5e860d156bf
  Resource Version:        132976289
  UID:                     2c85f92b-6718-48fc-9bd1-d6cb29b98a25
Spec:
  Billing Cycle:            hourly
  Hardware Reservation ID:  08faf1be-03ba-4d4f-801d-7e28576db73f,13e84d39-63ac-4d8b-b062-90393d1681ed,3b20e302-33cd-47f9-8199-0d1d2ffab63b,73f673ce-6b6c-4c5a-b5fb-f9690111b35a,7801d3f6-090a-428c-aca2-3d68b41fcccd,bba0b60a-7618-4e4d-aa3e-d49e92440e7c
  Machine Type:             m3.large.x86
  Os:                       ubuntu_20_04
  Provider ID:              equinixmetal://5e4a8a41-c2aa-40c2-92dc-d409294b8785
Status:
  Addresses:
    Address:  xxx
    Type:     ExternalIP
    Address:  xxx
    Type:     ExternalIP
    Address:  xxx
    Type:     InternalIP
  Conditions:
    Last Transition Time:  2022-09-13T17:11:47Z
    Message:               0 of 1 completed
    Reason:                InstanceProvisionFailed
    Severity:              Error
    Status:                False
    Type:                  Ready
    Last Transition Time:  2022-09-13T17:11:47Z
    Message:               GET https://api.equinix.com/metal/v1/devices/5e4a8a41-c2aa-40c2-92dc-d409294b8785?include=facility: 403  Unexpected Content-Type text/html with status 403 Forbidden
    Reason:                InstanceProvisionFailed
    Severity:              Error
    Status:                False
    Type:                  InstanceReady
  Failure Message:         device failed to provision: GET https://api.equinix.com/metal/v1/devices/5e4a8a41-c2aa-40c2-92dc-d409294b8785?include=facility: 403  Unexpected Content-Type text/html with status 403 Forbidden
  Failure Reason:          UpdateError
  Instance Status:         active
  Ready:                   true
Events:                    <none>

Environment:

cluster-api-provider-packet version: 1.15.0, 1.16.0
Kubernetes version: (use kubectl version): 1.22.13
OS (e.g. from /etc/os-release): Ubuntu 22.04

Sep 14 '22 19:09 jhead-slg

A rate-limiting 403 (as opposed to a 429) would create other problems: https://github.com/kubernetes-sigs/cluster-api-provider-packet/blob/a6d36083511981e576639920930c011273a9eb37/controllers/packetmachine_controller.go#L276-L281

CAPP would consider the machine to be deleted (since a 403 from the Equinix Metal /devices API would indicate a failed provision).

Sep 15 '22 17:09 displague

To avoid the rate-limiting scenario, error based reconcile retries should use a calmer approach. The following discuss this pattern for controller-runtime controllers, such as CAPP. Any new parameters should be exposed as CAPP configuration options (as many of the other configuration parameters are today).

https://stuartleeks.com/posts/error-back-off-with-controller-runtime/
https://danielmangum.com/posts/controller-runtime-client-go-rate-limiting/

Sep 15 '22 18:09 displague

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 08 '23 18:01 k8s-triage-robot

/remove-lifecycle stale

Jan 09 '23 16:01 cprivitere

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 16 '23 19:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 19 '24 21:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Feb 18 '24 21:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 18 '24 21:02 k8s-ci-robot

/reopen

Mar 19 '24 13:03 cprivitere

/remove-lifecycle rotten

Mar 19 '24 13:03 cprivitere

@cprivitere: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 19 '24 13:03 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 17 '24 13:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 17 '24 13:07 k8s-triage-robot

cluster-api-provider-packet cluster-api-provider-packet copied to clipboard

All machines in Failed status

cluster-api-provider-packet
cluster-api-provider-packet copied to clipboard