cloud-provider-azure
cloud-provider-azure copied to clipboard
Improvements to reduce rate limiting with flex scalesets
What type of PR is this?
/kind bug
What this PR does / why we need it:
The implementation of Azure flexible scalesets in cloud controller manager is causing a high rate of API rate limiting when used at scale due to the volume of calls made to retrieve instances, this has the knock on affect of causing instances to be deleted from kubernetes because the error is passed down and the instance is handled as not found: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/371c1509418bdad8cbe23ff2d093f762e9abcf60/pkg/provider/azure_vmssflex_cache.go#L146-L157
This PR implements a new NodeExistsByProviderID
method on the scaleset and for uniform and standard implementations this should behave the same as currently, for flex we extract the VM name from the provider ID and get the VM from the vm cache (which will call GetVirtualMachine
if uncached) this saves us pulling the scaleset cache and listing every VM when InstanceExists
is invoked by the node lifecycle controller. Also changes the power status and provisioning status checks to use a provider ID instead of node names for the same reasons so we can easily retrieve the desired VM just by taking the VM name from the provider ID in flex
Which issue(s) this PR fixes:
Fixes https://github.com/kubernetes-sigs/cloud-provider-azure/issues/2880
Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Hi @gcampbell12. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: gcampbell12 Once this PR has been reviewed and has the lgtm label, please assign andyzhangx for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/ok-to-test
/retest
Thanks for the contribution! But I'm a little concerned on this PR as it would impact all of the VM types.
Could you share the logs of CCM on VMSS flex nodes, especially why the Nodes were deleted? When cache refresh failed, the controller should continue to refresh until it succeeds.
@feiskyer I'll get some logs to you later but to explain what we are seeing:
If we look at GetNodeNameByProviderID
this calls fs.getNodeNameByVMName
, this will retrieve a list of every VMSS and begin range through them calling ListVmssFlexVMsWithoutInstanceView
and ListVmssFlexVMsWithOnlyInstanceView
for each scaleset, if you have say 50 scalesets and a couple of hundred nodes you soon get rate limited and azure returns 429's (I believe the number of VMs returned in these responses also contributes to rate limits being used up), this error is propagated from the vmclient back to here https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/pkg/provider/azure_vmssflex_cache.go#L84-L87
And here it just ends up being logged and we carry on looping over the rest of VMSS's (further using up rate limits)
https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/pkg/provider/azure_vmssflex_cache.go#L146-L149
Eventually the getter fails and is ran again with a forced cache refresh which also fails due to the volume of calls and we return cloudprovider.InstanceNotFound
https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/pkg/provider/azure_vmssflex_cache.go#L157
Here we return false to the node lifecycle controller https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/pkg/provider/azure_instances.go#L205-L207 which causes the node lifecycle controller to delete the node from k8s https://github.com/kubernetes-sigs/cloud-provider-azure/blob/d9bb39d2e81d766416a5cee932403d246927de32/vendor/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go#L161-L178
Since I've made some improvements to the underlying cache methods in flex I might be able to revert the change to add new Interfaces for all the scalesets and just fix the flex ones in place, I'll look at that later.
@feiskyer It's actually pretty easy to replicate this if you client side rate limit yourself e.g.
ERROR [2023-09-07T15:46:55.990557043Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-1] (pid: 10)
ERROR [2023-09-07T15:46:55.990567562Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990585285Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-2] (pid: 10)
ERROR [2023-09-07T15:46:55.990602967Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990630277Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-3] (pid: 10)
ERROR [2023-09-07T15:46:55.990645964Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990661875Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-4] (pid: 10)
ERROR [2023-09-07T15:46:55.990700825Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990714065Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-5] (pid: 10)
INFO [2023-09-07T15:46:55.990728857Z] sigs.k8s.io/cloud-provider-azure/node_lifecycle_controller.go:164: deleting node since it is no longer present in cloud provider: [k8s-node-name] (pid: 10)
@gcampbell12: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
pull-cloud-provider-azure-e2e-ccm-vmss-capz | b48a0964321671b7cf9b83cd2e6ce76c638f655a | link | true | /test pull-cloud-provider-azure-e2e-ccm-vmss-capz |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Reopen this PR with
/reopen
- Mark this PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closedYou can:
- Reopen this PR with
/reopen
- Mark this PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.