cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

AWSMachineTemplate with spot instance generate false error log

Open mloiseleur opened this issue 2 years ago • 11 comments

/kind bug

What steps did you take and what happened: When I declare spotInstance on a new AWSMachineTemplate, like this:

--- a/clusters/infrastructure.cluster.x-k8s.io-v1beta1.AWSMachineTemplate-xxx-v1.yaml
+++ b/clusters/infrastructure.cluster.x-k8s.io-v1beta1.AWSMachineTemplate-xxx-v2.yaml
@@ -1,11 +1,13 @@
 apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
 kind: AWSMachineTemplate
 metadata:
-  name: xxx-v1
+  name: xxx-v2
   namespace: flux-system
 spec:
   template:
     spec:
       iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
       instanceType: xxx.medium
+      spotMarketOptions:
+        maxPrice: ""
       sshKeyName: xxx

Everything works fine. MachineDeployment using this template provision the new nodes, etc. They are seen by the management cluster and by the workload cluster.

But it generates a false error log on capi-controller-manager :

E0411 12:54:02.362276       1 controller.go:317] controller/machine "msg"="Reconciler error" "error"="machines.cluster.x-k8s.io \"XXXX\" not found" "name"=XXXX" "namespace"="flux-system" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"

Even though the nodes and the machines are here, consistent, up and running.

What did you expect to happen:

I don't get a false error log message when I use spot instances.

Anything else you would like to add:

When I remove spotMarketOptions, I don't get this false error log.

Environment:

  • Cluster-api version: 1.1.3
  • Cluster-api-provider-aws version: 1.4.0
  • Kubernetes version: (use kubectl version): v1.22.6-eks-7d68063
  • OS (e.g. from /etc/os-release): EKS

mloiseleur avatar Apr 11 '22 13:04 mloiseleur

Hey @mloiseleur 👋 . Thanks for this issue. I don't think this is really an issue, cause while reconciling the machines, machine controller tries to fetch the machine object from etcd, and if it is not found, it is reconciled again. Till the machine is created, you'll see this error message, and it's not just with spot instances. Even with on-demand instances, you'll see this error.

Ankitasw avatar Apr 11 '22 13:04 Ankitasw

Hey @Ankitasw :wave:,

This error message is not until the machine is created.

It never stops to fill the logs, even after the new nodes/machines are created, pods are moved on it and everything is stable.

Without spot instances, this error message cease when the new node/machine is up and running.

mloiseleur avatar Apr 11 '22 14:04 mloiseleur

/triage accepted /milestone v1.4.1 /priority important-soon

sedefsavas avatar Apr 12 '22 04:04 sedefsavas

@sedefsavas: The provided milestone is not valid for this repository. Milestones in this repository: [Backlog, V1.4.1, v0.6.10, v0.7.4, v1.5.0, v1.x, v2.x]

Use /milestone clear to clear the milestone.

In response to this:

/triage accepted /milestone v1.4.1 /priority important-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 12 '22 04:04 k8s-ci-robot

/assign @Madhur97

sedefsavas avatar Apr 12 '22 04:04 sedefsavas

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 11 '22 05:07 k8s-triage-robot

/remove-lifecycle stale

mloiseleur avatar Jul 25 '22 12:07 mloiseleur

@mloiseleur please do not remove lifecycle-stale. That is responsibility of maintainers. Is this issue still reproducible?

Ankitasw avatar Jul 25 '22 12:07 Ankitasw

@Madhur97 is there a chance you are still working on this? If not, please unassign yourself.

Ankitasw avatar Jul 25 '22 12:07 Ankitasw

@Ankitasw Yes, this issue is still reproducible.

mloiseleur avatar Jul 25 '22 14:07 mloiseleur

/unassign @Madhur97

Ankitasw avatar Jul 26 '22 11:07 Ankitasw

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 24 '22 12:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 23 '22 13:11 k8s-triage-robot

The issue has been marked as an important bug and triaged. Such issues are automatically marked as frozen when hitting the rotten state to avoid missing important bugs.

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle frozen

k8s-triage-robot avatar Nov 23 '22 15:11 k8s-triage-robot

We are also seeing this error in (CAPI) metrics : controller_runtime_reconcile_errors_total{controller="machine", namespace="capi-system}" It's still present in v2.0.2 of CAPA & v1.3.0 of CAPI

mloiseleur avatar Dec 30 '22 15:12 mloiseleur

We have spot instance tests running in our CI, which doesn't flood with this error, you could check out the logs here

Here is the cluster template used. Could you please check if you are doing configurations differently?

Ankitasw avatar Jan 06 '23 07:01 Ankitasw

Here are differences, we're also using AWSManagedControlPlane (EKS), arm arch & gp3 rootVolume:

--- e2e.yaml	2023-01-06 09:11:39.092292421 +0100
+++ traefik.yaml	2023-01-06 09:13:12.056007510 +0100
@@ -8,7 +8,14 @@
   template:
     spec:
       iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
-      instanceType: t3.large
+      instanceType: t4g.medium
       spotMarketOptions:
         maxPrice: ""
       sshKeyName: cluster-api-provider-aws-sigs-k8s-io
+      additionalTags:
+        Cluster: xxxx
+      ami:
+        id: ami-xxxx
+      rootVolume:
+        size: 20
+        type: gp3

Current log on CAPI controller:

E0106 08:07:57.476618       1 controller.go:326] "Reconciler error" 
err="nodes \"ip-10-1-6-141.eu-west-2.compute.internal\" not found" 
controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" 
machine="flux-system/s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59-ctbhl" 
namespace="flux-system" name="s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59-ctbhl" 
reconcileID=d6320174-f4dc-4447-999b-12907f03c310

Despite the fact that the AWSMachine is here:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachine
metadata:
  [...]
  name: s00-euw2-eks-hub-479483661-lmwxp
  namespace: flux-system
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Machine
    name: s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59-ctbhl
    uid: 75706a6e-f96f-4ac6-96de-347f1e2ebe32
  resourceVersion: "104466243"
  uid: 884b566d-6b1d-4bba-abfe-a54471e06d24
spec:
  additionalTags:
    Cluster: s00-euw2-eks-hub
  ami:
    id: ami-0ea2dd5fe1bdc0f29
  cloudInit:
    secureSecretsBackend: secrets-manager
  iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
  instanceID: i-0205dc51f93d0545f
  instanceType: t4g.medium
  providerID: aws:///eu-west-2a/i-0205dc51f93d0545f
  rootVolume:
    size: 20
    type: gp3
  spotMarketOptions:
    maxPrice: ""
  sshKeyName: containous
status:
  addresses:
  - address: ip-10-1-6-141.eu-west-2.compute.internal
    type: InternalDNS
  - address: 10.1.6.141
    type: InternalIP
  conditions:
  - lastTransitionTime: "2023-01-06T05:18:38Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-01-06T05:18:38Z"
    status: "True"
    type: InstanceReady
  - lastTransitionTime: "2023-01-06T05:18:38Z"
    status: "True"
    type: SecurityGroupsReady
  instanceState: running
  interruptible: true
  ready: true

And the Machine also:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
  name: s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59-ctbhl
  namespace: flux-system
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: s00-euw2-eks-hub-eu-west-2a-md0-7667d9cf59
   [...]
spec:
[...]
  failureDomain: eu-west-2a
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSMachine
    name: s00-euw2-eks-hub-479483661-lmwxp
    namespace: flux-system
  nodeDeletionTimeout: 10s
  providerID: aws:///eu-west-2a/i-0205dc51f93d0545f
  version: v1.24.7
status:
  addresses:
  - address: ip-10-1-6-141.eu-west-2.compute.internal
    type: InternalDNS
  - address: 10.1.6.141
    type: InternalIP
  bootstrapReady: true
  conditions:
  - lastTransitionTime: "2023-01-06T05:18:39Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-01-06T05:18:20Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2023-01-06T05:19:57Z"
    status: "True"
    type: HealthCheckSucceeded
  - lastTransitionTime: "2023-01-06T05:18:39Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2023-01-06T05:19:57Z"
    status: "True"
    type: NodeHealthy
  infrastructureReady: true
  lastUpdated: "2023-01-06T05:19:16Z"
  nodeInfo:
    architecture: arm64
    containerRuntimeVersion: containerd://1.6.6
    kernelVersion: 5.4.219-126.411.amzn2.aarch64
    kubeProxyVersion: v1.24.7-eks-fb459a0
    kubeletVersion: v1.24.7-eks-fb459a0
    machineID: ec2d0a7a0e62ed08f85acdee7c2dbccd
    operatingSystem: linux
    osImage: Amazon Linux 2
  nodeRef:
    apiVersion: v1
    kind: Node
    name: ip-10-1-6-141.eu-west-2.compute.internal
  observedGeneration: 3
  phase: Running

And the Node ip-10-1-6-141.eu-west-2.compute.internal is here on workload cluster:

apiVersion: v1
kind: Node
metadata:
[...]
spec:
  providerID: aws:///eu-west-2a/i-0205dc51f93d0545f
status:
  addresses:
  - address: 10.1.6.141
    type: InternalIP
  - address: ip-10-1-6-141.eu-west-2.compute.internal
    type: Hostname
  - address: ip-10-1-6-141.eu-west-2.compute.internal
    type: InternalDNS
  allocatable:
    attachable-volumes-aws-ebs: "39"
    cpu: 1930m
    ephemeral-storage: "18233774458"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 3384668Ki
    pods: "110"
  capacity:
    attachable-volumes-aws-ebs: "39"
    cpu: "2"
    ephemeral-storage: 20949996Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 3939676Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2023-01-06T09:02:17Z"
    lastTransitionTime: "2023-01-06T05:19:16Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2023-01-06T09:02:17Z"
    lastTransitionTime: "2023-01-06T05:19:16Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2023-01-06T09:02:17Z"
    lastTransitionTime: "2023-01-06T05:19:16Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2023-01-06T09:02:17Z"
    lastTransitionTime: "2023-01-06T05:19:56Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
[...]
  nodeInfo:
    architecture: arm64
    containerRuntimeVersion: containerd://1.6.6
    kernelVersion: 5.4.219-126.411.amzn2.aarch64
    kubeProxyVersion: v1.24.7-eks-fb459a0
    kubeletVersion: v1.24.7-eks-fb459a0
    machineID: ec2d0a7a0e62ed08f85acdee7c2dbccd
    operatingSystem: linux
    osImage: Amazon Linux 2

mloiseleur avatar Jan 06 '23 09:01 mloiseleur

We've run into the same issue, any idea when this might be prioritized? I'd be happy to try and contribute a fix if nobody else has already.

daveortiz-ctct avatar Mar 03 '23 19:03 daveortiz-ctct

@daveortiz-ctct please go ahead and assign it to yourself if you are willing to contribute.

Ankitasw avatar Mar 06 '23 06:03 Ankitasw

We've not found any way to fix it but with

  1. Cluster API: v1.4.1
  2. AWS Provider: v2.0.2

This issue has vanished.

mloiseleur avatar Apr 26 '23 08:04 mloiseleur