cluster-api-provider-aws [WIP] Cache added in AWSMachine and AWSCluster reconcilers

trafficstars

What type of PR is this? /kind feature

What this PR does / why we need it: Adding a cache in AWSCluster and AWSMachine reconcilers to avoid multiple reconcile request within a debouncing window.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Part of #2794

Special notes for your reviewer: If multiple requests comes within debouncing window then will put those requests in a queue and serve one request in each debouncing window.

Checklist:

[ ] squashed commits
[ ] includes documentation
[ ] adds unit tests
[ ] adds or updates e2e tests

Release note:

Added cache in AWSMachine and AWSCluster reconcilers to avoid processing of multiple request within same debouncing window.

Nov 30 '21 18:11 shivi28

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign cecilerobertmichon after the PR has been reviewed. You can assign the PR to them by writing /assign @cecilerobertmichon in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Nov 30 '21 18:11 k8s-ci-robot

@shivi28 - what happens when you requeue during reconciliation (which may be within the debounce period)?

Dec 01 '21 15:12 richardcase

/test pull-cluster-api-provider-aws-test

Jan 06 '22 06:01 shivi28

@shivi28 - what happens when you requeue during reconciliation (which may be within the debounce period)?

Hey @richardcase, if you get multiple requests within debouncing window, then will park other requests and serve it after current debounce period overs

Jan 07 '22 04:01 shivi28

/triage accepted /priority important-soon

Jan 07 '22 07:01 Ankitasw

@shivi28 i think you missed release note, since this is a feature, it will be good to put note

Jan 07 '22 15:01 Ankitasw

/test ?

Jan 10 '22 08:01 richardcase

@richardcase: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 10 '22 08:01 k8s-ci-robot

/test pull-cluster-api-provider-aws-e2e

Jan 10 '22 08:01 richardcase

In CAPZ, they have something called coalescing reconciler which is LRU cache with a ttl so if there was a successful reconcile, reconciler result is added to the cache with a ttl. So, any reconcile requests within that ttl will not go through actual reconciliation.

Steps: (thanks @shysank) 1- reconciler req 2- check if cache (ttl cache) has a value for this request 3- if yes, return 4- if no, do normal reconciliation 5- if success -> add/update cache 6- if error, just return normally

I haven't reviewed the PR yet, but this is for reference. Would be good to have comparison between two approaches (if there is any difference).

Jan 11 '22 23:01 sedefsavas

Hey I'm a bit confused here, if I'm reading correctly the original issue description describes a problem that happens because all Machines are requeued all of a sudden in bulk, creating multiple different Requests and so overloading AWS API (As I read it is an scalability issue that'd you face as well if e.g you try to scale out very aggressively in one go, e.g going from 1 to 500). How does introducing a rate limiter for individual Requests helps to solve that problem?

cc @CecileRobertMichon since capz was mentioned as a reference.

Jan 12 '22 12:01 enxebre

In CAPZ we added the coalescing reconciler in an effort to not reconcile the same object too often when reconciliation is successful. Great to see the same improvement being made in CAPA. That being said, if the goal is to limit reconciles for many different Machines being queued at the same time I don't think it accomplishes that goal.

cc @devigned who was the original author of the cache + coalescing reconciler in CAPZ ref: https://github.com/devigned/cluster-api-provider-azure/commit/b6b38b0e6f81b57e98f9eb933e3e6a9cbf2f5936

Side note: when a PR is heavily inspired from existing code in another repo please consider giving credit to the original author @shivi28

Jan 12 '22 18:01 CecileRobertMichon

@CecileRobertMichon summed it up well.

We introduced async reconciliation of resources (not waiting on the API to complete a long-running operation, like creating a VM), but rather checking back after a bit in a future reconciliation loop to see if the resource was done provisioning. With async reconciliation in some cases reconciliations would take 200ms, rather than 60+ seconds for a given resource. Since reconciliation loops were taking ~200ms they could responds to events much faster, and issue significantly more API calls / second to Azure. For example, if a controller listened to events from other resources, the activity from a large number of events could cause a high number of calls into the Azure API. The increase in Azure API calls / second caused throttling for us. That's why we introduced a minimum amount of time before a resource is allowed to reconcile after a successful reconciliation, a debouncer. Our key for the TTLRU cache was specific to a single resource.

Jan 12 '22 19:01 devigned

@CecileRobertMichon @devigned thanks for the feedbacks.

We introduced async reconciliation of resources (not waiting on the API to complete a long-running operation, like creating a VM), but rather checking back after a bit in a future reconciliation loop to see if the resource was done provisioning.

@devigned We considered doing that but this is not possible for CAPA as there is no Futures support in AWS API as in Azure.

Jan 12 '22 19:01 sedefsavas

Thanks everyone for your reviews and apologies for delay in response due to async discussions🙏

This PR puts the limit on reconciliation of same object, as pointed out by @CecileRobertMichon this won't help us in limiting the reconciles for many different Machines being queued at the same time.

@enxebre sorry for the confusion, but this PR doesn't solve the original issue and I raised it as an initial step to solve the rate limiting problem by introducing debouncing window in CAPA. But after having a PR walkthrough with @sedefsavas and @pydctw, we have arrived on a conclusion to create a separate discussion on solving the original issue e2e and till then keeping this PR on hold.

Jan 28 '22 11:01 shivi28

@shivi28: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 08 '22 19:03 k8s-ci-robot

@CecileRobertMichon @devigned thanks for the feedbacks.

We introduced async reconciliation of resources (not waiting on the API to complete a long-running operation, like creating a VM), but rather checking back after a bit in a future reconciliation loop to see if the resource was done provisioning.

@devigned We considered doing that but this is not possible for CAPA as there is no Futures support in AWS API as in Azure.

In a few recent changes we have requeued instead of calling a Wait* style API.

Apr 11 '22 11:04 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 10 '22 11:07 k8s-triage-robot

@shivi28: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-build-release-1-5	848bc8491aafe9b41cb414697ce2e8a6c91da2a6	link	true	`/test pull-cluster-api-provider-aws-build-release-1-5`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Jul 29 '22 12:07 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Aug 28 '22 13:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Sep 27 '22 14:09 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 27 '22 14:09 k8s-ci-robot

cluster-api-provider-aws cluster-api-provider-aws copied to clipboard

[WIP] Cache added in AWSMachine and AWSCluster reconcilers

cluster-api-provider-aws
cluster-api-provider-aws copied to clipboard