cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
AWSMachineTemplate with spot instance generate false error log
/kind bug
What steps did you take and what happened: When I declare spotInstance on a new AWSMachineTemplate, like this:
--- a/clusters/infrastructure.cluster.x-k8s.io-v1beta1.AWSMachineTemplate-xxx-v1.yaml
+++ b/clusters/infrastructure.cluster.x-k8s.io-v1beta1.AWSMachineTemplate-xxx-v2.yaml
@@ -1,11 +1,13 @@
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachineTemplate
metadata:
- name: xxx-v1
+ name: xxx-v2
namespace: flux-system
spec:
template:
spec:
iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
instanceType: xxx.medium
+ spotMarketOptions:
+ maxPrice: ""
sshKeyName: xxx
Everything works fine. MachineDeployment using this template provision the new nodes, etc. They are seen by the management cluster and by the workload cluster.
But it generates a false error log on capi-controller-manager :
E0411 12:54:02.362276 1 controller.go:317] controller/machine "msg"="Reconciler error" "error"="machines.cluster.x-k8s.io \"XXXX\" not found" "name"=XXXX" "namespace"="flux-system" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
Even though the nodes and the machines are here, consistent, up and running.
What did you expect to happen:
I don't get a false error log message when I use spot instances.
Anything else you would like to add:
When I remove spotMarketOptions, I don't get this false error log.
Environment:
- Cluster-api version: 1.1.3
- Cluster-api-provider-aws version: 1.4.0
- Kubernetes version: (use
kubectl version
): v1.22.6-eks-7d68063 - OS (e.g. from
/etc/os-release
): EKS
Hey @mloiseleur 👋 . Thanks for this issue. I don't think this is really an issue, cause while reconciling the machines, machine controller tries to fetch the machine object from etcd, and if it is not found, it is reconciled again. Till the machine is created, you'll see this error message, and it's not just with spot instances. Even with on-demand instances, you'll see this error.
Hey @Ankitasw :wave:,
This error message is not until the machine is created.
It never stops to fill the logs, even after the new nodes/machines are created, pods are moved on it and everything is stable.
Without spot instances, this error message cease when the new node/machine is up and running.
/triage accepted /milestone v1.4.1 /priority important-soon
@sedefsavas: The provided milestone is not valid for this repository. Milestones in this repository: [Backlog
, V1.4.1
, v0.6.10
, v0.7.4
, v1.5.0
, v1.x
, v2.x
]
Use /milestone clear
to clear the milestone.
In response to this:
/triage accepted /milestone v1.4.1 /priority important-soon
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign @Madhur97
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
@mloiseleur please do not remove lifecycle-stale. That is responsibility of maintainers. Is this issue still reproducible?
@Madhur97 is there a chance you are still working on this? If not, please unassign yourself.
@Ankitasw Yes, this issue is still reproducible.
/unassign @Madhur97
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The issue has been marked as an important bug and triaged. Such issues are automatically marked as frozen when hitting the rotten state to avoid missing important bugs.
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle frozen
We are also seeing this error in (CAPI) metrics : controller_runtime_reconcile_errors_total{controller="machine", namespace="capi-system}"
It's still present in v2.0.2 of CAPA & v1.3.0 of CAPI
We have spot instance tests running in our CI, which doesn't flood with this error, you could check out the logs here
Here is the cluster template used. Could you please check if you are doing configurations differently?
Here are differences, we're also using AWSManagedControlPlane (EKS), arm arch & gp3 rootVolume:
--- e2e.yaml 2023-01-06 09:11:39.092292421 +0100
+++ traefik.yaml 2023-01-06 09:13:12.056007510 +0100
@@ -8,7 +8,14 @@
template:
spec:
iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
- instanceType: t3.large
+ instanceType: t4g.medium
spotMarketOptions:
maxPrice: ""
sshKeyName: cluster-api-provider-aws-sigs-k8s-io
+ additionalTags:
+ Cluster: xxxx
+ ami:
+ id: ami-xxxx
+ rootVolume:
+ size: 20
+ type: gp3
Current log on CAPI controller:
E0106 08:07:57.476618 1 controller.go:326] "Reconciler error"
err="nodes \"ip-10-1-6-141.eu-west-2.compute.internal\" not found"
controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine"
machine="flux-system/s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59-ctbhl"
namespace="flux-system" name="s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59-ctbhl"
reconcileID=d6320174-f4dc-4447-999b-12907f03c310
Despite the fact that the AWSMachine is here:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachine
metadata:
[...]
name: s00-euw2-eks-hub-479483661-lmwxp
namespace: flux-system
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: Machine
name: s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59-ctbhl
uid: 75706a6e-f96f-4ac6-96de-347f1e2ebe32
resourceVersion: "104466243"
uid: 884b566d-6b1d-4bba-abfe-a54471e06d24
spec:
additionalTags:
Cluster: s00-euw2-eks-hub
ami:
id: ami-0ea2dd5fe1bdc0f29
cloudInit:
secureSecretsBackend: secrets-manager
iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
instanceID: i-0205dc51f93d0545f
instanceType: t4g.medium
providerID: aws:///eu-west-2a/i-0205dc51f93d0545f
rootVolume:
size: 20
type: gp3
spotMarketOptions:
maxPrice: ""
sshKeyName: containous
status:
addresses:
- address: ip-10-1-6-141.eu-west-2.compute.internal
type: InternalDNS
- address: 10.1.6.141
type: InternalIP
conditions:
- lastTransitionTime: "2023-01-06T05:18:38Z"
status: "True"
type: Ready
- lastTransitionTime: "2023-01-06T05:18:38Z"
status: "True"
type: InstanceReady
- lastTransitionTime: "2023-01-06T05:18:38Z"
status: "True"
type: SecurityGroupsReady
instanceState: running
interruptible: true
ready: true
And the Machine
also:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
name: s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59-ctbhl
namespace: flux-system
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: MachineSet
name: s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59
[...]
spec:
[...]
failureDomain: eu-west-2a
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachine
name: s00-euw2-eks-hub-479483661-lmwxp
namespace: flux-system
nodeDeletionTimeout: 10s
providerID: aws:///eu-west-2a/i-0205dc51f93d0545f
version: v1.24.7
status:
addresses:
- address: ip-10-1-6-141.eu-west-2.compute.internal
type: InternalDNS
- address: 10.1.6.141
type: InternalIP
bootstrapReady: true
conditions:
- lastTransitionTime: "2023-01-06T05:18:39Z"
status: "True"
type: Ready
- lastTransitionTime: "2023-01-06T05:18:20Z"
status: "True"
type: BootstrapReady
- lastTransitionTime: "2023-01-06T05:19:57Z"
status: "True"
type: HealthCheckSucceeded
- lastTransitionTime: "2023-01-06T05:18:39Z"
status: "True"
type: InfrastructureReady
- lastTransitionTime: "2023-01-06T05:19:57Z"
status: "True"
type: NodeHealthy
infrastructureReady: true
lastUpdated: "2023-01-06T05:19:16Z"
nodeInfo:
architecture: arm64
containerRuntimeVersion: containerd://1.6.6
kernelVersion: 5.4.219-126.411.amzn2.aarch64
kubeProxyVersion: v1.24.7-eks-fb459a0
kubeletVersion: v1.24.7-eks-fb459a0
machineID: ec2d0a7a0e62ed08f85acdee7c2dbccd
operatingSystem: linux
osImage: Amazon Linux 2
nodeRef:
apiVersion: v1
kind: Node
name: ip-10-1-6-141.eu-west-2.compute.internal
observedGeneration: 3
phase: Running
And the Node
ip-10-1-6-141.eu-west-2.compute.internal is here on workload cluster:
apiVersion: v1
kind: Node
metadata:
[...]
spec:
providerID: aws:///eu-west-2a/i-0205dc51f93d0545f
status:
addresses:
- address: 10.1.6.141
type: InternalIP
- address: ip-10-1-6-141.eu-west-2.compute.internal
type: Hostname
- address: ip-10-1-6-141.eu-west-2.compute.internal
type: InternalDNS
allocatable:
attachable-volumes-aws-ebs: "39"
cpu: 1930m
ephemeral-storage: "18233774458"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
hugepages-32Mi: "0"
hugepages-64Ki: "0"
memory: 3384668Ki
pods: "110"
capacity:
attachable-volumes-aws-ebs: "39"
cpu: "2"
ephemeral-storage: 20949996Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
hugepages-32Mi: "0"
hugepages-64Ki: "0"
memory: 3939676Ki
pods: "110"
conditions:
- lastHeartbeatTime: "2023-01-06T09:02:17Z"
lastTransitionTime: "2023-01-06T05:19:16Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2023-01-06T09:02:17Z"
lastTransitionTime: "2023-01-06T05:19:16Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2023-01-06T09:02:17Z"
lastTransitionTime: "2023-01-06T05:19:16Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2023-01-06T09:02:17Z"
lastTransitionTime: "2023-01-06T05:19:56Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
[...]
nodeInfo:
architecture: arm64
containerRuntimeVersion: containerd://1.6.6
kernelVersion: 5.4.219-126.411.amzn2.aarch64
kubeProxyVersion: v1.24.7-eks-fb459a0
kubeletVersion: v1.24.7-eks-fb459a0
machineID: ec2d0a7a0e62ed08f85acdee7c2dbccd
operatingSystem: linux
osImage: Amazon Linux 2
We've run into the same issue, any idea when this might be prioritized? I'd be happy to try and contribute a fix if nobody else has already.
@daveortiz-ctct please go ahead and assign it to yourself if you are willing to contribute.
We've not found any way to fix it but with
- Cluster API:
v1.4.1
- AWS Provider:
v2.0.2
This issue has vanished.