claudie icon indicating copy to clipboard operation
claudie copied to clipboard

Bug: Autoscaler stuck

Open bernardhalas opened this issue 11 months ago • 4 comments

Claudie 0.9.2

Current Behaviour

Autoscaler seems stuck after terraformer. No signs of action in kube-eleven.

Expected Behaviour

4 nodes are added to the cluster.

Steps To Reproduce

Simple deployment with nginx created: kubectl create deployment nginx --image=nginx

Added resources.requests of cpu: 1 and memory: 1 Gi. And then upscaled to 6 instances: kubectl scale deployment/nginx --replicas=6

terraformer logs create 4 new nodes successfully. builder contains just:

2025-01-04T20:18:07Z INF Using log with the level "info" module=builder
2025-01-04T20:18:12Z WRN Waiting for all dependent services to be healthy module=builder
2025-01-04T20:18:17Z INF All dependent services are now healthy module=builder

And kube-eleven has no indication of up-scale need after the initial cluster creation:

2025-01-04T16:59:49Z INF Kubernetes cluster was successfully build cluster=gcp-cluster-ummzddq module=kube-eleven project=claudie-gcp-example-manifest

Deleting builder and kube-eleven seems like doesn't change the state of things.

This was executed on the following InputManifest:

apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
  name: gcp-example-manifest
  namespace: claudie
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  providers:
    - name: gcp-1
      providerType: gcp
      secretRef:
        name: gcp-secret-1
        namespace: claudie

  nodePools:
    dynamic:
      - name: control-gcp
        providerSpec:
          name: gcp-1
          region: europe-west1
          zone: europe-west1-c
        count: 1
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206

      - name: compute-1-gcp
        providerSpec:
          name: gcp-1
          region: europe-west3
          zone: europe-west3-a
        count: 2
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
        storageDiskSize: 50

      - name: compute-2-gcp
        providerSpec:
          name: gcp-1
          region: europe-west2
          zone: europe-west2-a
        autoscaler:
          min: 0
          max: 5
        serverType: e2-medium
        image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
        storageDiskSize: 50

  kubernetes:
    clusters:
      - name: gcp-cluster
        version: v1.29.0
        network: 192.168.2.0/24
        pools:
          control:
            - control-gcp
          compute:
            - compute-1-gcp
            - compute-2-gcp

bernardhalas avatar Jan 04 '25 21:01 bernardhalas

@bernardhalas Did builder restart ? There is a ~3hours difference based on the logs

2025-01-04T20:18:07Z INF Using log with the level "info" module=builder
2025-01-04T20:18:12Z WRN Waiting for all dependent services to be healthy module=builder
2025-01-04T20:18:17Z INF All dependent services are now healthy module=builder
2025-01-04T16:59:49Z INF Kubernetes cluster was successfully build cluster=gcp-cluster-ummzddq module=kube-eleven project=claudie-gcp-example-manifest

Despire avatar Jan 06 '25 06:01 Despire

Deleting builder and kube-eleven seems like doesn't change the state of things.

Yes, the pod has been force-restarted. And the messages were the same.

I tried to reproduce a few times, but I couldn't. I saw similar behavior once in down-sizing the nodepool by autoscaler. But that one occurred also just once. I'll spend more time on this if the situation allows, otherwise we'll close this down as unreproducible.

bernardhalas avatar Jan 07 '25 09:01 bernardhalas

@bernardhalas

I assume the following happened, the builder service was restarted whether by you or OOM killed https://github.com/berops/claudie/issues/1512.

When this happens The manifest will not be rescheduled again in 2 hours. Which I think is wrong, there has been an issue created for it long time ago https://github.com/berops/claudie/issues/1316

Hard to say without logs of the crashed builder pod, though

Despire avatar Jan 07 '25 09:01 Despire

The builder was restarted intentionally as the autoscaler was already stuck before. Apologies for the confusion caused. So the problem occurred (VM was created, but not added to the cluster) and after ~3 hrs, I restarted the builder to see if it can fix the problem. It didn't help.

bernardhalas avatar Jan 07 '25 09:01 bernardhalas