cloud-provider-azure Improvements to reduce rate limiting with flex scalesets

What type of PR is this?

/kind bug

What this PR does / why we need it:

The implementation of Azure flexible scalesets in cloud controller manager is causing a high rate of API rate limiting when used at scale due to the volume of calls made to retrieve instances, this has the knock on affect of causing instances to be deleted from kubernetes because the error is passed down and the instance is handled as not found: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/371c1509418bdad8cbe23ff2d093f762e9abcf60/pkg/provider/azure_vmssflex_cache.go#L146-L157

This PR implements a new NodeExistsByProviderID method on the scaleset and for uniform and standard implementations this should behave the same as currently, for flex we extract the VM name from the provider ID and get the VM from the vm cache (which will call GetVirtualMachine if uncached) this saves us pulling the scaleset cache and listing every VM when InstanceExists is invoked by the node lifecycle controller. Also changes the power status and provisioning status checks to use a provider ID instead of node names for the same reasons so we can easily retrieve the desired VM just by taking the VM name from the provider ID in flex

Which issue(s) this PR fixes:

Fixes https://github.com/kubernetes-sigs/cloud-provider-azure/issues/2880

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Aug 30 '23 18:08 gcampbell12

Hi @gcampbell12. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 30 '23 18:08 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gcampbell12 Once this PR has been reviewed and has the lgtm label, please assign andyzhangx for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 30 '23 18:08 k8s-ci-robot

/ok-to-test

Aug 30 '23 20:08 odinuge

/retest

Sep 06 '23 11:09 gcampbell12

Thanks for the contribution! But I'm a little concerned on this PR as it would impact all of the VM types.

Could you share the logs of CCM on VMSS flex nodes, especially why the Nodes were deleted? When cache refresh failed, the controller should continue to refresh until it succeeds.

Sep 08 '23 05:09 feiskyer

@feiskyer I'll get some logs to you later but to explain what we are seeing:

If we look at GetNodeNameByProviderID this calls fs.getNodeNameByVMName, this will retrieve a list of every VMSS and begin range through them calling ListVmssFlexVMsWithoutInstanceView and ListVmssFlexVMsWithOnlyInstanceView for each scaleset, if you have say 50 scalesets and a couple of hundred nodes you soon get rate limited and azure returns 429's (I believe the number of VMs returned in these responses also contributes to rate limits being used up), this error is propagated from the vmclient back to here https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/pkg/provider/azure_vmssflex_cache.go#L84-L87

And here it just ends up being logged and we carry on looping over the rest of VMSS's (further using up rate limits) https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/pkg/provider/azure_vmssflex_cache.go#L146-L149 Eventually the getter fails and is ran again with a forced cache refresh which also fails due to the volume of calls and we return cloudprovider.InstanceNotFound https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/pkg/provider/azure_vmssflex_cache.go#L157 Here we return false to the node lifecycle controller https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/pkg/provider/azure_instances.go#L205-L207 which causes the node lifecycle controller to delete the node from k8s https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/vendor/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go#L161-L178

Since I've made some improvements to the underlying cache methods in flex I might be able to revert the change to add new Interfaces for all the scalesets and just fix the flex ones in place, I'll look at that later.

Sep 08 '23 06:09 gcampbell12

@feiskyer It's actually pretty easy to replicate this if you client side rate limit yourself e.g.

ERROR [2023-09-07T15:46:55.990557043Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-1] (pid: 10)
ERROR [2023-09-07T15:46:55.990567562Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990585285Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-2] (pid: 10)
ERROR [2023-09-07T15:46:55.990602967Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990630277Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-3] (pid: 10)
ERROR [2023-09-07T15:46:55.990645964Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990661875Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-4] (pid: 10)
ERROR [2023-09-07T15:46:55.990700825Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990714065Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-5] (pid: 10)
INFO  [2023-09-07T15:46:55.990728857Z] sigs.k8s.io/cloud-provider-azure/node_lifecycle_controller.go:164: deleting node since it is no longer present in cloud provider: [k8s-node-name] (pid: 10)

Sep 08 '23 07:09 gcampbell12

@gcampbell12: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cloud-provider-azure-e2e-ccm-vmss-capz	b48a0964321671b7cf9b83cd2e6ce76c638f655a	link	true	`/test pull-cloud-provider-azure-e2e-ccm-vmss-capz`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Oct 31 '23 03:10 k8s-ci-robot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 13 '24 08:01 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 12 '24 09:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 12 '24 10:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Jun 11 '24 10:06 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 11 '24 10:06 k8s-ci-robot

cloud-provider-azure cloud-provider-azure copied to clipboard

Improvements to reduce rate limiting with flex scalesets

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

cloud-provider-azure
cloud-provider-azure copied to clipboard