Bug: Autoscaler stuck
Claudie 0.9.2
Current Behaviour
Autoscaler seems stuck after terraformer. No signs of action in kube-eleven.
Expected Behaviour
4 nodes are added to the cluster.
Steps To Reproduce
Simple deployment with nginx created:
kubectl create deployment nginx --image=nginx
Added resources.requests of cpu: 1 and memory: 1 Gi. And then upscaled to 6 instances:
kubectl scale deployment/nginx --replicas=6
terraformer logs create 4 new nodes successfully. builder contains just:
2025-01-04T20:18:07Z INF Using log with the level "info" module=builder
2025-01-04T20:18:12Z WRN Waiting for all dependent services to be healthy module=builder
2025-01-04T20:18:17Z INF All dependent services are now healthy module=builder
And kube-eleven has no indication of up-scale need after the initial cluster creation:
2025-01-04T16:59:49Z INF Kubernetes cluster was successfully build cluster=gcp-cluster-ummzddq module=kube-eleven project=claudie-gcp-example-manifest
Deleting builder and kube-eleven seems like doesn't change the state of things.
This was executed on the following InputManifest:
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
name: gcp-example-manifest
namespace: claudie
labels:
app.kubernetes.io/part-of: claudie
spec:
providers:
- name: gcp-1
providerType: gcp
secretRef:
name: gcp-secret-1
namespace: claudie
nodePools:
dynamic:
- name: control-gcp
providerSpec:
name: gcp-1
region: europe-west1
zone: europe-west1-c
count: 1
serverType: e2-medium
image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
- name: compute-1-gcp
providerSpec:
name: gcp-1
region: europe-west3
zone: europe-west3-a
count: 2
serverType: e2-medium
image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
storageDiskSize: 50
- name: compute-2-gcp
providerSpec:
name: gcp-1
region: europe-west2
zone: europe-west2-a
autoscaler:
min: 0
max: 5
serverType: e2-medium
image: ubuntu-os-cloud/ubuntu-2204-jammy-v20221206
storageDiskSize: 50
kubernetes:
clusters:
- name: gcp-cluster
version: v1.29.0
network: 192.168.2.0/24
pools:
control:
- control-gcp
compute:
- compute-1-gcp
- compute-2-gcp
@bernardhalas Did builder restart ? There is a ~3hours difference based on the logs
2025-01-04T20:18:07Z INF Using log with the level "info" module=builder
2025-01-04T20:18:12Z WRN Waiting for all dependent services to be healthy module=builder
2025-01-04T20:18:17Z INF All dependent services are now healthy module=builder
2025-01-04T16:59:49Z INF Kubernetes cluster was successfully build cluster=gcp-cluster-ummzddq module=kube-eleven project=claudie-gcp-example-manifest
Deleting builder and kube-eleven seems like doesn't change the state of things.
Yes, the pod has been force-restarted. And the messages were the same.
I tried to reproduce a few times, but I couldn't. I saw similar behavior once in down-sizing the nodepool by autoscaler. But that one occurred also just once. I'll spend more time on this if the situation allows, otherwise we'll close this down as unreproducible.
@bernardhalas
I assume the following happened, the builder service was restarted whether by you or OOM killed https://github.com/berops/claudie/issues/1512.
When this happens The manifest will not be rescheduled again in 2 hours. Which I think is wrong, there has been an issue created for it long time ago https://github.com/berops/claudie/issues/1316
Hard to say without logs of the crashed builder pod, though
The builder was restarted intentionally as the autoscaler was already stuck before. Apologies for the confusion caused. So the problem occurred (VM was created, but not added to the cluster) and after ~3 hrs, I restarted the builder to see if it can fix the problem. It didn't help.