cluster-api-provider-azure icon indicating copy to clipboard operation
cluster-api-provider-azure copied to clipboard

Cluster-Autoscaler test

Open CecileRobertMichon opened this issue 3 years ago • 19 comments

We should test the cluster-autoscaler + CAPZ integration by running periodic tests that create a CAPZ cluster, install the autoscaler, and perform validations that autoscaler scales the cluster machines up and down as expected when workloads are deployed / deleted.

Enabling autoscaler for cluster-api: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/clusterapi#enabling-autoscaling

some prior art: https://github.com/Azure/aks-engine/blob/master/test/e2e/kubernetes/kubernetes_test.go#L2559

CecileRobertMichon avatar Mar 29 '21 23:03 CecileRobertMichon

@jackfrancis this might be an interesting one to you

CecileRobertMichon avatar Mar 29 '21 23:03 CecileRobertMichon

/help

CecileRobertMichon avatar Mar 29 '21 23:03 CecileRobertMichon

@CecileRobertMichon: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 29 '21 23:03 k8s-ci-robot

/assign

jackfrancis avatar Mar 29 '21 23:03 jackfrancis

Notes: the out-of-the-box spec provided here isn't working yet:

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/clusterapi

I'm stuck on this error message in the cluster-autoscaler pod logs:

W0401 20:12:38.794974 1 static_autoscaler.go:798] Couldn't find template for node group MachineDeployment/default/francis-default2-md-0

jackfrancis avatar Apr 01 '21 20:04 jackfrancis

Finally got around to testing this, here are my observations:

My setup was a kind management cluster and one CAPZ workload cluster in the default namespace.

Steps:

  1. (assuming the workload cluster kubeconfig is at ./kubeconfig) kubectl create secret generic kubeconfig --from-file=kubeconfig -n kube-system

  2. create a file cluster-autoscaler.yaml NOTE: this was taken directly from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/examples/deployment.yaml, with only args and mount modified with my workload cluster kubeconfig:

        args:
        - --cloud-provider=clusterapi
        - --kubeconfig=/mnt/kubeconfig
        - --clusterapi-cloud-config-authoritative
        volumeMounts:   
        - mountPath: "/mnt"
          name: kubeconfig
          readOnly: true
  1. on the management cluster:
 kubectl annotate md --all cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size='2'
 kubectl annotate md --all cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size='6'
 export AUTOSCALER_NS=kube-system
 export AUTOSCALER_IMAGE=us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.19.1
 envsubst < cluster-autoscaler.yaml | kubectl apply -f-

NOTE: I used cluster-autoscaler 1.19 because my workload cluster is on k8s 1.19 but one thing that wasn't clear to me is whether the workload cluster k8s version or the management cluster k8s version should match the CAS version.

made sure the cluster-autoscaler pod was running

  1. on the workload cluster:
kubectl create deployment busybox --image busybox
kubectl scale deploy busybox --replicas 300

watch a bunch of busybox pods get created and eventually some get stuck in pending

What happened:

  • CAS did scale up my machine deployment as expected:
I0406 03:41:35.029169       1 scale_up.go:663] Scale-up: setting group MachineDeployment/default/test-cluster-md-0 size to 3
I0406 03:41:48.656485       1 scale_up.go:663] Scale-up: setting group MachineDeployment/default/test-cluster-md-0 size to 4
  • As new machines are getting created, a bunch of W0406 03:42:00.431937 1 clusterapi_controller.go:454] Machine "test-cluster-arc-md-0-644b6b79b7-ln75c" has no providerID type messages are logged: this is normal. That's because the new machine is still coming up and the VM backing is not yet running, thus it doesn't yet have a provider ID.
  • this what my nodes look like after ~20 minutes:
test-cluster-control-plane-8fg94   Ready    master   5m22s   v1.19.9
test-cluster-control-plane-dpgg5   Ready    master   24m     v1.19.9
test-cluster-control-plane-pss2t   Ready    master   14m     v1.19.9
test-cluster-md-0-4hm9p            Ready    <none>   24m     v1.19.9
test-cluster-md-0-5856m            Ready    <none>   7h19m   v1.19.9
test-cluster-md-0-np7d7            Ready    <none>   24m     v1.19.9
test-cluster-md-0-rlgj6            Ready    <none>   31m     v1.19.9

(the cluster originally had one worker nodes and the control plane nodes were rolled because of a kcp update)

  • after a while, cluster-autoscaler logs just look like this:
I0406 04:06:30.401179       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0406 04:06:30.427592       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 22.4348ms
I0406 04:08:30.303103       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0406 04:08:30.465262       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 153.8689ms
I0406 04:10:30.350449       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0406 04:10:30.383659       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 33.1352ms
  • I did see one crash of the cluster-autoscaler pod that ended with these logs:
I0406 03:53:36.954309       1 scale_down.go:930] Scale-down: removing node test-cluster-md-0-np7d7, utilization: {0.125 0 0 cpu 0.125}, pods to reschedule: default/busybox-6cf8756958-xdm8z,default/busybox-6cf8756958-xnnff,default/busybox-6cf8756958-26zq6,default/httpd-757fb56c8d-dgjlh,default/httpd-757fb56c8d-xljpz,default/httpd-757fb56c8d-8tnqt,default/busybox-6cf8756958-972vc,default/busybox-6cf8756958-kfqwc,default/httpd-757fb56c8d-rdp28,default/httpd-757fb56c8d-6g9k9,default/busybox-6cf8756958-qsglp,default/httpd-757fb56c8d-8pcgb,default/httpd-757fb56c8d-b9mbk,default/busybox-6cf8756958-vf5mc,default/httpd-757fb56c8d-ljhf6,default/httpd-757fb56c8d-zjkql,default/busybox-6cf8756958-nlxpr,default/httpd-757fb56c8d-76v5h,default/httpd-757fb56c8d-l7h9w,default/httpd-757fb56c8d-hp5wx,default/httpd-757fb56c8d-vhg56,default/httpd-757fb56c8d-l5gqh,default/httpd-757fb56c8d-xr4nl,default/busybox-6cf8756958-vfrbs,default/httpd-757fb56c8d-lhvzd,default/httpd-757fb56c8d-95v82,default/busybox-6cf8756958-gwmqs,default/busybox-6cf8756958-pm4mk,default/busybox-6cf8756958-wm2gl,default/httpd-757fb56c8d-vfgt2,default/busybox-6cf8756958-w5glz,default/httpd-757fb56c8d-9l8l4,default/httpd-757fb56c8d-vfbbc,default/httpd-757fb56c8d-lhkph,default/httpd-757fb56c8d-f424m,default/busybox-6cf8756958-ggh54,default/busybox-6cf8756958-qzf6q,default/httpd-757fb56c8d-29bcj,default/httpd-757fb56c8d-d5q28,default/httpd-757fb56c8d-vx48w
I0406 03:53:38.392811       1 request.go:645] Throttling request took 1.1394141s, request: POST:https://test-cluster-3b683832.southcentralus.cloudapp.azure.com:6443/api/v1/namespaces/default/pods/httpd-757fb56c8d-6g9k9/eviction
E0406 03:53:43.674899       1 scale_down.go:1238] Not deleted yet default/busybox-6cf8756958-xdm8z
E0406 03:53:49.027730       1 scale_down.go:1238] Not deleted yet default/httpd-757fb56c8d-dgjlh
E0406 03:53:54.331126       1 scale_down.go:1238] Not deleted yet default/httpd-757fb56c8d-dgjlh
E0406 03:54:07.003522       1 leaderelection.go:321] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://test-cluster-3b683832.southcentralus.cloudapp.azure.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0406 03:54:07.003691       1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0406 03:54:07.011543       1 main.go:435] lost master
goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.stacks(0xc00000e001, 0xc0007e2ee0, 0x37, 0xd9)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:996 +0xb8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.(*loggingT).output(0x58b8780, 0xc000000003, 0x0, 0x0, 0xc002927c70, 0x57c7755, 0x7, 0x1b3, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:945 +0x19d
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.(*loggingT).printf(0x58b8780, 0x3, 0x0, 0x0, 0x359029c, 0xb, 0x0, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:733 +0x17b
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2.Fatalf(...)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1456
main.main.func3()
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:435 +0x73
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000548900)

Not sure exactly what happened there but it looks like it might have been affected by my KCP update that was causing the control plane nodes to do a rolling update in the background. It did restart and recover. need to investigate some more

  • I did not see any logs like Removing unregistered node anywhere

Eventually, I deleted the busybox deployment and my extra nodes got deleted I0406 04:39:44.993338 1 scale_down.go:1053] Scale-down: removing empty node test-cluster-arc-md-0-rlgj6.

CecileRobertMichon avatar Apr 06 '21 04:04 CecileRobertMichon

Update: I did run into the unregistered node issue eventually, and I think I've narrowed it down to the provider ID of the node not matching the provider ID of the Machine.

Cloud provider sets the ID by doing azure + :// + ID

https://github.com/kubernetes-sigs/cloud-provider-azure/blob/28f04006af77503360c42c5c3930ea707b897108/pkg/node/node.go#L63

Whereas CAPZ sets the ID with 3 slashes: azure+ :/// + ID:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/azure/services/virtualmachines/virtualmachines.go#L97

the ID itself starts with a / so it ends up being 3 vs 4 slashes.

There was an attempt to fix this a while ago in https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/655 but I believe it wasn't fixed properly because it didn't account for the leading slash in the ID.

CecileRobertMichon avatar Apr 06 '21 05:04 CecileRobertMichon

thanks for the pointer here @CecileRobertMichon. i'm curious to see how this turns out and if there is any way i can help from the autoscaler side, please don't hesitate to reach out =)

edit: also, i really would like to see this PR become the future direction of autoscaler e2e testing. but, we are a little ways away from that still. this gist covers some of my early thinking about how to improve.

elmiko avatar Apr 14 '21 19:04 elmiko

Status:

https://github.com/jackfrancis/cluster-api-provider-azure/blob/cluster-autoscaler-test/scripts/cluster-autoscaler-test.sh

jackfrancis avatar Apr 20 '21 19:04 jackfrancis

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Jul 19 '21 19:07 fejta-bot

/remove-lifecycle stale

devigned avatar Jul 19 '21 20:07 devigned

Status:

https://github.com/jackfrancis/cluster-api-provider-azure/blob/cluster-autoscaler-test/scripts/cluster-autoscaler-test.sh

Nice work @jackfrancis . Do you have an ETA for a PR?

fiunchinho avatar Sep 21 '21 10:09 fiunchinho

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 20 '21 11:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 19 '22 11:01 k8s-triage-robot

/remove-lifecycle rotten

CecileRobertMichon avatar Feb 04 '22 18:02 CecileRobertMichon

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 05 '22 18:05 k8s-triage-robot

@CecileRobertMichon happy to help out here if i can =)

/remove-lifecycle stale

elmiko avatar May 05 '22 18:05 elmiko

+1 myself as well, I've gathered enough scars in capz E2E land over the past several months that I definitely feel qualified to make quick progress on this

jackfrancis avatar May 05 '22 20:05 jackfrancis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 03 '22 21:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 02 '22 21:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 02 '22 21:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 02 '22 21:10 k8s-ci-robot