cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard
Cluster-Autoscaler test
We should test the cluster-autoscaler + CAPZ integration by running periodic tests that create a CAPZ cluster, install the autoscaler, and perform validations that autoscaler scales the cluster machines up and down as expected when workloads are deployed / deleted.
Enabling autoscaler for cluster-api: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/clusterapi#enabling-autoscaling
some prior art: https://github.com/Azure/aks-engine/blob/master/test/e2e/kubernetes/kubernetes_test.go#L2559
@jackfrancis this might be an interesting one to you
/help
@CecileRobertMichon: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign
Notes: the out-of-the-box spec provided here isn't working yet:
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/clusterapi
I'm stuck on this error message in the cluster-autoscaler pod logs:
W0401 20:12:38.794974 1 static_autoscaler.go:798] Couldn't find template for node group MachineDeployment/default/francis-default2-md-0
Finally got around to testing this, here are my observations:
My setup was a kind management cluster and one CAPZ workload cluster in the default namespace.
Steps:
-
(assuming the workload cluster kubeconfig is at
./kubeconfig
)kubectl create secret generic kubeconfig --from-file=kubeconfig -n kube-system
-
create a file cluster-autoscaler.yaml NOTE: this was taken directly from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/examples/deployment.yaml, with only args and mount modified with my workload cluster kubeconfig:
args:
- --cloud-provider=clusterapi
- --kubeconfig=/mnt/kubeconfig
- --clusterapi-cloud-config-authoritative
volumeMounts:
- mountPath: "/mnt"
name: kubeconfig
readOnly: true
- on the management cluster:
kubectl annotate md --all cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size='2'
kubectl annotate md --all cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size='6'
export AUTOSCALER_NS=kube-system
export AUTOSCALER_IMAGE=us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.19.1
envsubst < cluster-autoscaler.yaml | kubectl apply -f-
NOTE: I used cluster-autoscaler 1.19 because my workload cluster is on k8s 1.19 but one thing that wasn't clear to me is whether the workload cluster k8s version or the management cluster k8s version should match the CAS version.
made sure the cluster-autoscaler pod was running
- on the workload cluster:
kubectl create deployment busybox --image busybox
kubectl scale deploy busybox --replicas 300
watch a bunch of busybox pods get created and eventually some get stuck in pending
What happened:
- CAS did scale up my machine deployment as expected:
I0406 03:41:35.029169 1 scale_up.go:663] Scale-up: setting group MachineDeployment/default/test-cluster-md-0 size to 3
I0406 03:41:48.656485 1 scale_up.go:663] Scale-up: setting group MachineDeployment/default/test-cluster-md-0 size to 4
- As new machines are getting created, a bunch of
W0406 03:42:00.431937 1 clusterapi_controller.go:454] Machine "test-cluster-arc-md-0-644b6b79b7-ln75c" has no providerID
type messages are logged: this is normal. That's because the new machine is still coming up and the VM backing is not yet running, thus it doesn't yet have a provider ID. - this what my nodes look like after ~20 minutes:
test-cluster-control-plane-8fg94 Ready master 5m22s v1.19.9
test-cluster-control-plane-dpgg5 Ready master 24m v1.19.9
test-cluster-control-plane-pss2t Ready master 14m v1.19.9
test-cluster-md-0-4hm9p Ready <none> 24m v1.19.9
test-cluster-md-0-5856m Ready <none> 7h19m v1.19.9
test-cluster-md-0-np7d7 Ready <none> 24m v1.19.9
test-cluster-md-0-rlgj6 Ready <none> 31m v1.19.9
(the cluster originally had one worker nodes and the control plane nodes were rolled because of a kcp update)
- after a while, cluster-autoscaler logs just look like this:
I0406 04:06:30.401179 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0406 04:06:30.427592 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 22.4348ms
I0406 04:08:30.303103 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0406 04:08:30.465262 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 153.8689ms
I0406 04:10:30.350449 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0406 04:10:30.383659 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 33.1352ms
- I did see one crash of the cluster-autoscaler pod that ended with these logs:
I0406 03:53:36.954309 1 scale_down.go:930] Scale-down: removing node test-cluster-md-0-np7d7, utilization: {0.125 0 0 cpu 0.125}, pods to reschedule: default/busybox-6cf8756958-xdm8z,default/busybox-6cf8756958-xnnff,default/busybox-6cf8756958-26zq6,default/httpd-757fb56c8d-dgjlh,default/httpd-757fb56c8d-xljpz,default/httpd-757fb56c8d-8tnqt,default/busybox-6cf8756958-972vc,default/busybox-6cf8756958-kfqwc,default/httpd-757fb56c8d-rdp28,default/httpd-757fb56c8d-6g9k9,default/busybox-6cf8756958-qsglp,default/httpd-757fb56c8d-8pcgb,default/httpd-757fb56c8d-b9mbk,default/busybox-6cf8756958-vf5mc,default/httpd-757fb56c8d-ljhf6,default/httpd-757fb56c8d-zjkql,default/busybox-6cf8756958-nlxpr,default/httpd-757fb56c8d-76v5h,default/httpd-757fb56c8d-l7h9w,default/httpd-757fb56c8d-hp5wx,default/httpd-757fb56c8d-vhg56,default/httpd-757fb56c8d-l5gqh,default/httpd-757fb56c8d-xr4nl,default/busybox-6cf8756958-vfrbs,default/httpd-757fb56c8d-lhvzd,default/httpd-757fb56c8d-95v82,default/busybox-6cf8756958-gwmqs,default/busybox-6cf8756958-pm4mk,default/busybox-6cf8756958-wm2gl,default/httpd-757fb56c8d-vfgt2,default/busybox-6cf8756958-w5glz,default/httpd-757fb56c8d-9l8l4,default/httpd-757fb56c8d-vfbbc,default/httpd-757fb56c8d-lhkph,default/httpd-757fb56c8d-f424m,default/busybox-6cf8756958-ggh54,default/busybox-6cf8756958-qzf6q,default/httpd-757fb56c8d-29bcj,default/httpd-757fb56c8d-d5q28,default/httpd-757fb56c8d-vx48w
I0406 03:53:38.392811 1 request.go:645] Throttling request took 1.1394141s, request: POST:https://test-cluster-3b683832.southcentralus.cloudapp.azure.com:6443/api/v1/namespaces/default/pods/httpd-757fb56c8d-6g9k9/eviction
E0406 03:53:43.674899 1 scale_down.go:1238] Not deleted yet default/busybox-6cf8756958-xdm8z
E0406 03:53:49.027730 1 scale_down.go:1238] Not deleted yet default/httpd-757fb56c8d-dgjlh
E0406 03:53:54.331126 1 scale_down.go:1238] Not deleted yet default/httpd-757fb56c8d-dgjlh
E0406 03:54:07.003522 1 leaderelection.go:321] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://test-cluster-3b683832.southcentralus.cloudapp.azure.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0406 03:54:07.003691 1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0406 03:54:07.011543 1 main.go:435] lost master
goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.stacks(0xc00000e001, 0xc0007e2ee0, 0x37, 0xd9)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:996 +0xb8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.(*loggingT).output(0x58b8780, 0xc000000003, 0x0, 0x0, 0xc002927c70, 0x57c7755, 0x7, 0x1b3, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:945 +0x19d
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.(*loggingT).printf(0x58b8780, 0x3, 0x0, 0x0, 0x359029c, 0xb, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:733 +0x17b
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.Fatalf(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1456
main.main.func3()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:435 +0x73
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000548900)
Not sure exactly what happened there but it looks like it might have been affected by my KCP update that was causing the control plane nodes to do a rolling update in the background. It did restart and recover. need to investigate some more
- I did not see any logs like
Removing unregistered node
anywhere
Eventually, I deleted the busybox deployment and my extra nodes got deleted I0406 04:39:44.993338 1 scale_down.go:1053] Scale-down: removing empty node test-cluster-arc-md-0-rlgj6
.
Update: I did run into the unregistered node issue eventually, and I think I've narrowed it down to the provider ID of the node not matching the provider ID of the Machine.
Cloud provider sets the ID by doing azure + :// + ID
https://github.com/kubernetes-sigs/cloud-provider-azure/blob/28f04006af77503360c42c5c3930ea707b897108/pkg/node/node.go#L63
Whereas CAPZ sets the ID with 3 slashes: azure+ :/// + ID
:
https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/azure/services/virtualmachines/virtualmachines.go#L97
the ID itself starts with a /
so it ends up being 3 vs 4 slashes.
There was an attempt to fix this a while ago in https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/655 but I believe it wasn't fixed properly because it didn't account for the leading slash in the ID.
thanks for the pointer here @CecileRobertMichon. i'm curious to see how this turns out and if there is any way i can help from the autoscaler side, please don't hesitate to reach out =)
edit: also, i really would like to see this PR become the future direction of autoscaler e2e testing. but, we are a little ways away from that still. this gist covers some of my early thinking about how to improve.
Status:
https://github.com/jackfrancis/cluster-api-provider-azure/blob/cluster-autoscaler-test/scripts/cluster-autoscaler-test.sh
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
Status:
https://github.com/jackfrancis/cluster-api-provider-azure/blob/cluster-autoscaler-test/scripts/cluster-autoscaler-test.sh
Nice work @jackfrancis . Do you have an ETA for a PR?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@CecileRobertMichon happy to help out here if i can =)
/remove-lifecycle stale
+1 myself as well, I've gathered enough scars in capz E2E land over the past several months that I definitely feel qualified to make quick progress on this
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.