external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

Understanding Throttling in external-dns

Open dav9 opened this issue 3 years ago • 2 comments

In a system of a high number of Kubernetes clusters in which external-dns runs configured to maintain all records for a single AWS Route53 zone, throttling errors are imminent with improper configuration. AWS Route53 is used as an example provider, and it has limit of 5 requests per account.

Below is a simplified representation of the requests that will be made in case of single zone. Since the requests are throttled on account, not on zone level, having more zones and fewer records per zones increases the number of requests. A consequent proper analysis needs to be more accurate and as well not focus only AWS Route53.

What happens on the main synchronization loop?

Q is the quota measured as requests per seconds. Z is the number of zones. S is the number of Services or Ingresses that a cluster has. R is the number of DNS records, then each external-dns instance will create R=3S records - the record, the TXT owner record and an ACME challenge TXT record. I s the number of clusters and the number external-dns instances (one per each cluster). D is the number of records that should be deleted, however external-dns instances for them no longer exist. P is the page size for AWS Route53 requests. The AWS Route53 API docs says: 300. The external-dns comment says 100. Note that the Route53 API does not use a cursor, hence it’s easy to miss records if they change (which they should do). U is the number of batch requests based on changes U = C / PU, where C is the changes and PU is update page size.

RI + D = 3SI + D is the total number of records to be created from all clusters and will be stored in the single zone.

When the requests are made there is a sleep interval and only batch change requests are made. All other requests are sent in burst mode, meaning that in case AWS Route53 can respond in under 1s / (5+1) req/s ~ 167 ms/req, external-dns will be throttled. When throttling event happens, an error is returned to the control loop. The next run is based on the next should run time.

An instance reads the zones, reads the records per zone, and updates records per zone. An instance will do Z + (3SI + D) / P requests to fetch the current state of Route53. The Z component is coming from single listing of the zones API call. Then it will do U update requests, hence Z + (3SI + D) / P + U.

Given that there are I instance of external DNS the total number of requests done will be (Z + (3SI + D) / P + U)I. These requests are not synchronized across the different instances, they vary on the external-dns start time. Given that all instances have started than each instance will attempt a somewhat full reconciliation loop in the sync interval, so we can get several requests per second by dividing on the Т (the sync interval time), where T is measured in seconds.

total requests per second = (Z + (3SI + D) / P + U)I / T.

And these requests need to be less than the quota Q.

(Z + (3SI + D) / P + U)I / T < Q or T > (Z + (3SI + D) / P + U)I / Q

Simplified based on significance of the terms: T> 3SI^2/Q.

The sync interval in such case is sensitive to the quadratic of the number of instances, other terms can be discarded.

The attached diagram explains more what happens when numbers are plugged – main-loop-explained.png.

main-loop-explained

What happens on the event synchronization?

Recently event-based synchronization was added to external DNS, this means that the above happens not only at interval T, but several changes in a cluster might start to trigger a synchronization loop.

If we take the smallest numbers in the attached main-loop-explained-diagram - 11 requests per event. In case we get E events, then will get 11E new requests to Route53.

An interesting effect is that on startup each Service or Ingress will trigger an individual event: Add RunOnce as the handler function that will be called when ingress/service sources have changed. Note that k8s Informers will perform an initial list operation, which results in the handler function initially being called for every Service/Ingress that exists.

These ‘run once’s will be handled each second, so any events that are processed within a second will count as one.

As a consequence, Route53 can be throttled for an account just by restarting an external-dns instance which has many resources or introducing high volume of new sources (services or ingresses) for a short period of time.

Attached you can find the scheduling.png that shows this issue and a possible solution.

scheduling

Questions and Suggestions for Synchronization on Event

  1. Could the event only trigger an upsert or delete of a resource, just act as an edge event, not as level (full reconciliation loop)?
  2. Now the event handler does not receive a handle to the actual object that triggered the change, so this is probably not immediately possible.
  3. The events can as well be queued for an interval and only then sent.

Others

We haven’t looked at how setting up cache intervals affects the above and many others.

General Questions

  1. Is someone using external-dns in a similar scenario or considered it and decided not to do it?
  2. Has someone considered coming up with ways to reason out how to configure proper cache times, sync intervals based on scenarios?

dav9 avatar Feb 14 '22 13:02 dav9

@dav9 if you want to use external-dns under high load with AWS - consider using it with CloudMap provider

voro015 avatar May 06 '22 00:05 voro015

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 04 '22 01:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 03 '22 01:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 03 '22 02:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 03 '22 02:10 k8s-ci-robot

Hi @dav9 , first of all, thank you for putting together such detailed explanation on External DNS process workflow. That's very helpful to help diagnosing where the pain point is. We are running exactly the same setup, and we are just getting started to address this throttling issue. Have you tried the suggested CloudMap provider and if yes, can you share your experience with us? Thank you!

bbhenry avatar Feb 05 '23 03:02 bbhenry

Hi @bbhenry, we haven't tried it yet, since with the cache settings the algorithm can be made to perform reasonably well.

dav9 avatar Feb 07 '23 07:02 dav9

@dav9 , thanks for responding. Which cache settings are you referring to and do you mind share your current setting and how it effects the results?

bbhenry avatar Feb 07 '23 08:02 bbhenry

Hi @dav9,

since with the cache settings

are you talking about --aws-zones-cache-duration=<> setting? Thanks!

s1-evgeny-shmarnev avatar Feb 09 '23 09:02 s1-evgeny-shmarnev

@s1-evgeny-shmarnev Yes, and the --txt-cache-interval. In addition if you don't have many changes, i.e. many events, you can use triggerLoopOnEvent, and increase the sync interval.

dav9 avatar Feb 09 '23 14:02 dav9