cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard
[WIP] Cache added in AWSMachine and AWSCluster reconcilers
What type of PR is this? /kind feature
What this PR does / why we need it: Adding a cache in AWSCluster and AWSMachine reconcilers to avoid multiple reconcile request within a debouncing window.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Part of #2794
Special notes for your reviewer: If multiple requests comes within debouncing window then will put those requests in a queue and serve one request in each debouncing window.
Checklist:
- [ ] squashed commits
- [ ] includes documentation
- [ ] adds unit tests
- [ ] adds or updates e2e tests
Release note:
Added cache in AWSMachine and AWSCluster reconcilers to avoid processing of multiple request within same debouncing window.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
To complete the pull request process, please assign cecilerobertmichon after the PR has been reviewed.
You can assign the PR to them by writing /assign @cecilerobertmichon in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@shivi28 - what happens when you requeue during reconciliation (which may be within the debounce period)?
/test pull-cluster-api-provider-aws-test
@shivi28 - what happens when you requeue during reconciliation (which may be within the debounce period)?
Hey @richardcase, if you get multiple requests within debouncing window, then will park other requests and serve it after current debounce period overs
/triage accepted /priority important-soon
@shivi28 i think you missed release note, since this is a feature, it will be good to put note
/test ?
@richardcase: The following commands are available to trigger required jobs:
/test pull-cluster-api-provider-aws-build/test pull-cluster-api-provider-aws-test/test pull-cluster-api-provider-aws-verify
The following commands are available to trigger optional jobs:
/test pull-cluster-api-provider-aws-e2e/test pull-cluster-api-provider-aws-e2e-blocking/test pull-cluster-api-provider-aws-e2e-conformance/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts/test pull-cluster-api-provider-aws-e2e-eks
Use /test all to run the following jobs that were automatically triggered:
pull-cluster-api-provider-aws-buildpull-cluster-api-provider-aws-testpull-cluster-api-provider-aws-verify
In response to this:
/test ?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/test pull-cluster-api-provider-aws-e2e
In CAPZ, they have something called coalescing reconciler which is LRU cache with a ttl so if there was a successful reconcile, reconciler result is added to the cache with a ttl. So, any reconcile requests within that ttl will not go through actual reconciliation.
Steps: (thanks @shysank) 1- reconciler req 2- check if cache (ttl cache) has a value for this request 3- if yes, return 4- if no, do normal reconciliation 5- if success -> add/update cache 6- if error, just return normally
I haven't reviewed the PR yet, but this is for reference. Would be good to have comparison between two approaches (if there is any difference).
Hey I'm a bit confused here, if I'm reading correctly the original issue description describes a problem that happens because all Machines are requeued all of a sudden in bulk, creating multiple different Requests and so overloading AWS API (As I read it is an scalability issue that'd you face as well if e.g you try to scale out very aggressively in one go, e.g going from 1 to 500). How does introducing a rate limiter for individual Requests helps to solve that problem?
cc @CecileRobertMichon since capz was mentioned as a reference.
In CAPZ we added the coalescing reconciler in an effort to not reconcile the same object too often when reconciliation is successful. Great to see the same improvement being made in CAPA. That being said, if the goal is to limit reconciles for many different Machines being queued at the same time I don't think it accomplishes that goal.
cc @devigned who was the original author of the cache + coalescing reconciler in CAPZ ref: https://github.com/devigned/cluster-api-provider-azure/commit/b6b38b0e6f81b57e98f9eb933e3e6a9cbf2f5936
Side note: when a PR is heavily inspired from existing code in another repo please consider giving credit to the original author @shivi28
@CecileRobertMichon summed it up well.
We introduced async reconciliation of resources (not waiting on the API to complete a long-running operation, like creating a VM), but rather checking back after a bit in a future reconciliation loop to see if the resource was done provisioning. With async reconciliation in some cases reconciliations would take 200ms, rather than 60+ seconds for a given resource. Since reconciliation loops were taking ~200ms they could responds to events much faster, and issue significantly more API calls / second to Azure. For example, if a controller listened to events from other resources, the activity from a large number of events could cause a high number of calls into the Azure API. The increase in Azure API calls / second caused throttling for us. That's why we introduced a minimum amount of time before a resource is allowed to reconcile after a successful reconciliation, a debouncer. Our key for the TTLRU cache was specific to a single resource.
@CecileRobertMichon @devigned thanks for the feedbacks.
We introduced async reconciliation of resources (not waiting on the API to complete a long-running operation, like creating a VM), but rather checking back after a bit in a future reconciliation loop to see if the resource was done provisioning.
@devigned We considered doing that but this is not possible for CAPA as there is no Futures support in AWS API as in Azure.
Thanks everyone for your reviews and apologies for delay in response due to async discussions🙏
This PR puts the limit on reconciliation of same object, as pointed out by @CecileRobertMichon this won't help us in limiting the reconciles for many different Machines being queued at the same time.
@enxebre sorry for the confusion, but this PR doesn't solve the original issue and I raised it as an initial step to solve the rate limiting problem by introducing debouncing window in CAPA. But after having a PR walkthrough with @sedefsavas and @pydctw, we have arrived on a conclusion to create a separate discussion on solving the original issue e2e and till then keeping this PR on hold.
@shivi28: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@CecileRobertMichon @devigned thanks for the feedbacks.
We introduced async reconciliation of resources (not waiting on the API to complete a long-running operation, like creating a VM), but rather checking back after a bit in a future reconciliation loop to see if the resource was done provisioning.
@devigned We considered doing that but this is not possible for CAPA as there is no Futures support in AWS API as in Azure.
In a few recent changes we have requeued instead of calling a Wait* style API.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@shivi28: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-cluster-api-provider-aws-build-release-1-5 | 848bc8491aafe9b41cb414697ce2e8a6c91da2a6 | link | true | /test pull-cluster-api-provider-aws-build-release-1-5 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Reopen this PR with
/reopen - Mark this PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closedYou can:
- Reopen this PR with
/reopen- Mark this PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.