machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Switch to exponential backoff while creating/deletion machines

Open hardikdr opened this issue 4 years ago • 4 comments

What would you like to be added: On failure of machine creation or deletion requests, MCM constantly tries to create or delete the machine-objects. This could cause a heavy load on control-cluster's API-server, and exhaust the API rate-limits of cloud-provider. We should exponentially back-off on the failure of requests.

Why is this needed:

hardikdr avatar Jul 04 '20 17:07 hardikdr

/assign @hardikdr @prashanth26 /priority blocker

prashanth26 avatar Sep 29 '20 08:09 prashanth26

/priority normal We implemented the constant backoff here #525. We should consider looking at a more sophisticated exponential backoff mechanism, a proposal would be nice. I mainly see 2 options,

  1. Backoff at the queue. An attempt to machine-set queue: https://github.com/gardener/machine-controller-manager/pull/510
  2. Backoff inside the reconcile function.
    • Maybe something similar to https://github.com/gardener/autoscaler/tree/machine-controller-manager-provider/cluster-autoscaler/utils/backoff .

cc @zuzzas

hardikdr avatar Oct 08 '20 11:10 hardikdr

Thanks to https://github.com/gardener/machine-controller-manager/pull/525 we can now attach a RateLimitingInterface to the queue, and throttle Machines in CrashLoopBackoff.

  1. I'd take the backoff_manager concept from here.
  2. Create a throttling-by-CrashLoopBackoff function here.
  3. And attach the resulting RateLimitingInterface to the queue here.

Then, there's a matter of replacing Adds with AddRateLimiteds to ensure that our new RateLimiter is being triggered.

zuzzas avatar Oct 08 '20 12:10 zuzzas

/title Switch to exponential backoff while creating/deletion machines

prashanth26 avatar Jul 21 '21 05:07 prashanth26