cluster-api-provider-packet
cluster-api-provider-packet copied to clipboard
All machines in Failed status
What steps did you take and what happened:
Working on removing reserved hardware from a machine deployment I was in the process of deleting machines while keeping replica count the same. It seems at some point I was possibly rate-limited getting back 403 from the API which caused all the machines to now show as Failed. I seem to be unable to get the machines back into a happy state as the Packet provider is skipping the check on them.
As a side note, there is no way to reduce replica count without other machines being deleted making the removal of specific reserved hardware difficult. This is because the machinedeployment will kill based only on random, oldest, or newest strategies regardless if unprovisioned machines could be removed instead.
What did you expect to happen: I expected to be able to delete the machines out while cluster-api stayed stable at least.
Anything else you would like to add: I assume this happened because we have 100s of reserved hardware ids and it was making API requests for them. Perhaps storing the reservation id in the Status of each packetmachine and checking that first before making an API call out would reduce requests.
I0914 19:18:41.250461 1 packetmachine_controller.go:227] controller/packetmachine "msg"="Reconciling PacketMachine" "cluster"="live-ewr2-mine-k8s-game" "machine"="k8s-game-cp-plxbw" "name"="k8s-game-cp-f66c50-dmr79" "namespace"="live-pkt-ewr2-mine-k8s-game" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="PacketMachine"
I0914 19:18:41.250491 1 packetmachine_controller.go:232] controller/packetmachine "msg"="Error state detected, skipping reconciliation" "cluster"="live-ewr2-mine-k8s-game" "machine"="k8s-game-cp-plxbw" "name"="k8s-game-cp-f66c50-dmr79" "namespace"="live-pkt-ewr2-mine-k8s-game" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="PacketMachine"
$ k get machines,packetmachines -o wide
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
machine.cluster.x-k8s.io/k8s-game-cp-2g5mv live-ewr2-mine-k8s-game k8s-game-cp-f66c50-q6hsc equinixmetal://0c47f32c-d588-4467-ad1b-9e92ec053a29 Failed 5d1h v1.22.13
machine.cluster.x-k8s.io/k8s-game-cp-gvdfs live-ewr2-mine-k8s-game k8s-game-cp-f66c50-7vx8v equinixmetal://98961a29-be25-4b65-9083-0bbc06fd1e91 Failed 5d2h v1.22.13
machine.cluster.x-k8s.io/k8s-game-cp-plxbw live-ewr2-mine-k8s-game k8s-game-cp-f66c50-dmr79 equinixmetal://5b92e444-e399-453d-b342-3ac015724765 Failed 5d1h v1.22.13
machine.cluster.x-k8s.io/k8s-game-cp-sxpcb live-ewr2-mine-k8s-game k8s-game-cp-f66c50-6h8q4 equinixmetal://5e4a8a41-c2aa-40c2-92dc-d409294b8785 Failed 5d v1.22.13
machine.cluster.x-k8s.io/k8s-game-cp-vmgqq live-ewr2-mine-k8s-game k8s-game-cp-f66c50-qwjt8 equinixmetal://2724700d-800c-448c-931b-88f58d874e99 Failed 5d2h v1.22.13
...
NAME CLUSTER STATE READY INSTANCEID MACHINE
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-6h8q4 live-ewr2-mine-k8s-game true equinixmetal://5e4a8a41-c2aa-40c2-92dc-d409294b8785 k8s-game-cp-sxpcb
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-7vx8v live-ewr2-mine-k8s-game true equinixmetal://98961a29-be25-4b65-9083-0bbc06fd1e91 k8s-game-cp-gvdfs
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-dmr79 live-ewr2-mine-k8s-game true equinixmetal://5b92e444-e399-453d-b342-3ac015724765 k8s-game-cp-plxbw
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-q6hsc live-ewr2-mine-k8s-game true equinixmetal://0c47f32c-d588-4467-ad1b-9e92ec053a29 k8s-game-cp-2g5mv
packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-qwjt8 live-ewr2-mine-k8s-game true equinixmetal://2724700d-800c-448c-931b-88f58d874e99 k8s-game-cp-vmgqq
...
$ k describe packetmachine.infrastructure.cluster.x-k8s.io/k8s-game-cp-f66c50-6h8q4
Name: k8s-game-cp-f66c50-6h8q4
Namespace: live-pkt-ewr2-mine-k8s-game
Labels: cluster.x-k8s.io/cluster-name=live-ewr2-mine-k8s-game
cluster.x-k8s.io/control-plane=
Annotations: cluster.x-k8s.io/cloned-from-groupkind: PacketMachineTemplate.infrastructure.cluster.x-k8s.io
cluster.x-k8s.io/cloned-from-name: k8s-game-cp-f66c50
API Version: infrastructure.cluster.x-k8s.io/v1beta1
Kind: PacketMachine
Metadata:
Creation Timestamp: 2022-09-09T15:38:32Z
Finalizers:
packetmachine.infrastructure.cluster.x-k8s.io
Generation: 2
Owner References:
API Version: controlplane.cluster.x-k8s.io/v1beta1
Kind: KubeadmControlPlane
Name: k8s-game-cp
UID: cb220baa-3174-4a9f-b48c-1dcb1e4fb1c6
API Version: cluster.x-k8s.io/v1beta1
Block Owner Deletion: true
Controller: true
Kind: Machine
Name: k8s-game-cp-sxpcb
UID: 757367c7-8089-4e28-8fca-f5e860d156bf
Resource Version: 132976289
UID: 2c85f92b-6718-48fc-9bd1-d6cb29b98a25
Spec:
Billing Cycle: hourly
Hardware Reservation ID: 08faf1be-03ba-4d4f-801d-7e28576db73f,13e84d39-63ac-4d8b-b062-90393d1681ed,3b20e302-33cd-47f9-8199-0d1d2ffab63b,73f673ce-6b6c-4c5a-b5fb-f9690111b35a,7801d3f6-090a-428c-aca2-3d68b41fcccd,bba0b60a-7618-4e4d-aa3e-d49e92440e7c
Machine Type: m3.large.x86
Os: ubuntu_20_04
Provider ID: equinixmetal://5e4a8a41-c2aa-40c2-92dc-d409294b8785
Status:
Addresses:
Address: xxx
Type: ExternalIP
Address: xxx
Type: ExternalIP
Address: xxx
Type: InternalIP
Conditions:
Last Transition Time: 2022-09-13T17:11:47Z
Message: 0 of 1 completed
Reason: InstanceProvisionFailed
Severity: Error
Status: False
Type: Ready
Last Transition Time: 2022-09-13T17:11:47Z
Message: GET https://api.equinix.com/metal/v1/devices/5e4a8a41-c2aa-40c2-92dc-d409294b8785?include=facility: 403 Unexpected Content-Type text/html with status 403 Forbidden
Reason: InstanceProvisionFailed
Severity: Error
Status: False
Type: InstanceReady
Failure Message: device failed to provision: GET https://api.equinix.com/metal/v1/devices/5e4a8a41-c2aa-40c2-92dc-d409294b8785?include=facility: 403 Unexpected Content-Type text/html with status 403 Forbidden
Failure Reason: UpdateError
Instance Status: active
Ready: true
Events: <none>
Environment:
- cluster-api-provider-packet version: 1.15.0, 1.16.0
- Kubernetes version: (use
kubectl version): 1.22.13 - OS (e.g. from
/etc/os-release): Ubuntu 22.04
A rate-limiting 403 (as opposed to a 429) would create other problems: https://github.com/kubernetes-sigs/cluster-api-provider-packet/blob/a6d36083511981e576639920930c011273a9eb37/controllers/packetmachine_controller.go#L276-L281
CAPP would consider the machine to be deleted (since a 403 from the Equinix Metal /devices API would indicate a failed provision).
To avoid the rate-limiting scenario, error based reconcile retries should use a calmer approach. The following discuss this pattern for controller-runtime controllers, such as CAPP. Any new parameters should be exposed as CAPP configuration options (as many of the other configuration parameters are today).
- https://stuartleeks.com/posts/error-back-off-with-controller-runtime/
- https://danielmangum.com/posts/controller-runtime-client-go-rate-limiting/
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
/remove-lifecycle rotten
@cprivitere: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten