external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

Why does external-dns poll? Polling causes too many API requests

Open azuretek opened this issue 6 years ago • 48 comments

Is there a reason external-dns is polling? Why not watch the event stream and trigger updates that way? There's no reason to poll on an interval if you can just watch for changes. It would drastically reduce the number of API requests and also be a lot quicker to reflect changes as services and ingresses are deployed.

azuretek avatar Mar 07 '18 03:03 azuretek

At some certain stage it might make sense to integrate "watch" capabilities, but polling is probably required anyway. For example, in case of External-DNS not running for a while, the list of services and ingresses created during that period of time should be handled as well. I am not entirely sure how well Kubernetes handles watching, but a year or so ago I found the API to be buggy.

The problem with "watching" is that we cannot simply make an API call to DNS provider on every single event, because those calls usually cost money and are normally rate limited. So with "watching" we would have to do some aggregation and batching.

We could allow to configure the polling interval to reduce the number of API calls, however I don't believe "watching" is a better solution to the "problem", especially in big clusters with lots of ingresses and services.

ideahitme avatar Mar 07 '18 10:03 ideahitme

I'm not seeing in the code where the polling is necessary, you can watch the event stream and just append changes as they come in and call submitChanges on the interval that's specified. You're already "batching" the way you described, it's just happening on a set interval.

The main improvement is that you eliminate API calls altogether until a change actually needs to be made.

If you're concerned about a fresh pod not being aware of changes that happened since starting you can do one initial poll to get the current state and then update as necessary.

I can contribute the code changes necessary to make this happen if that's a concern.

Just to clarify my issue and why I think this is a major problem. In our environment we use AWS and we have several clusters where external-dns is configured, we have lots of domains so every time external-dns polls we have at least zones*clusters queries to the AWS API (5 clusters with 20 zones = 100 API calls every minute) even when nothing has changed. This is causing us to hit limits with the AWS API and the only resolution is to either remove the number of domains managed by external-dns (requiring an external service to create CNAMEs for us) or to reduce the polling interval which directly impacts the speed we can deploy.

azuretek avatar Mar 07 '18 23:03 azuretek

I don't believe it is as simple as you described, with the concepts of ownership and multi target records, you have to maintain information like who owns the record, can I modify the record, etc either in memory (cache) or do the DNS provider get call. You want to avoid the latter, but in case of in memory storage, you might as well do the diff with the previous change to see if update is required. I would make this optional and not recommended for use anyway. However, I would love to see a proposal on how to use "watch" first with proper description how external dns will operate and preserve all the features it currently has

ideahitme avatar Mar 08 '18 10:03 ideahitme

Are we even talking about the same thing? Are we talking about polling the Kubernetes or the AWS API? @azuretek mentions hitting the rate limits of AWS.. Maybe we should identify the actual problem first before discussing potential solutions or improvements? Is the problem "External DNS hits AWS API rate limits"?

hjacobs avatar Mar 08 '18 10:03 hjacobs

@hjacobs I think he means to use Kubernetes API events to watch for changes and then do the AWS API call/ otherwise stay idle.

Currently the problem is we fetch the list of records from AWS even if no changes are required and this is the API call we want to prevent. However External DNS is smart enough not to "post" changes to AWS API if no changes were detected.

External DNS hitting AWS API rate limiting is a problem, but I think it should be addressed in other ways, e.g. with caching result. https://github.com/kubernetes-incubator/external-dns/issues/178

ideahitme avatar Mar 08 '18 10:03 ideahitme

How about having the controller trigger off informers watching Service/Ingress with the informer resync periods set to --interval? Then couple that with fronting the registry with a TTL cache (#178) so fetching records from the provider would occur once per --interval as it does currently.

The resync period/TTL cache would ensure that we maintained the current functionality (i.e. always ensuring state is reconciled between the provider and the cluster at least once per --interval) but would greatly improve the latency of changes in cluster being reflected in the provider.

API rate limits could be handled by exposing --cache-ttl flag or similar.

Related: #14

prydie avatar May 09 '18 13:05 prydie

I've run into this when running in an AWS account with a large number of Route53 zones. For whatever reason, it polls zones even if there are no ingress/service/etc manifests referencing that zone. Is there any way (besides filtering on domain name param) to optimise things such that it doesn't look at zones not relevant to anything configured inside kubernetes?

(In my case the account had 250+ zones... and with no filter, despite the cluster coming up with maybe a half-dozen records on just a single zone, all 249 other zones are getting scanned, confirmed by looking at CloudTrail logs, resulting in the API throttling so badly it sometimes took 10-20 minutes before external-dns could get any records provisioned.)

For the moment I've worked around it by specifying a whitelist of domains that can get managed by external-dns to keep how much it's scanning to a minimum.

jhohertz avatar May 11 '18 15:05 jhohertz

Some things to add to this thread:

Watching on k8s events and batching seems fine but those aren't your only events, yeah? What happens if a record gets modified outside of external-dns' scope? A regular poll as @prydie suggests would still be wise.

@jhohertz to your point I thought that was unintuitive too but external-dns has to delete records too. That said, whitelisting domains is the way to go and that's what we do. We include all our public domains, and then only the private domains for the VPC we're running external-dns in, for each VPC.

Just ranting here, but honestly the problem here is with Amazon's APIs, which I understand we can't easily change... ideally they would give you the ability to post to an SNS topic or something like that when Route53 calls are made so we could watch on AWS events the same as we can on K8s events.

2rs2ts avatar Jun 08 '18 17:06 2rs2ts

We're seeing similar things with the Cloudflare provider.

Our account has approximately 10,000 zones which means (with the maximum pagination allowed) that's 200 API calls to return solely the zones. --domain-filter dictates that we're only actually interested in two of those zones, and in those zones, there are only about 75-100 pages of records

Cloudflare limits 1200 requests per 5 minutes which with DNS' default interval of 1m gives room for about 250 requests a minute, which based on the above means we're hitting the limit (Issue is exasperated if you reuse client credentials on more than one cluster running external DNS). Decreasing the interval is certainly a workaround but of course it does mean provisioning of services is impacted.

Would restructuring so that --domain-filter is used at the time records/zones are queried in the provider to only look at said zones, rather than just being used to filte records after they have been retrieved from the provider, or are there other considerations needed?

Evesy avatar Feb 27 '19 11:02 Evesy

Do you confirm that this is happening with the latest version released (v0.5.11)?

On Wed, Feb 27, 2019, 12:35 Mike Eves [email protected] wrote:

We're seeing similar things with the Cloudflare provider.

Our account has approximately 10,000 zones which means (with the maximum pagination allowed) that's 200 API calls to return solely the zones. --domain-filter dictates that we're only actually interested in two of those zones, and in those zones, there are only about 75-100 pages of records

Cloudflare limits 1200 requests per 5 minutes which with DNS' default interval of 1m gives room for about 250 requests a minute, which based on the above means we're hitting the limit (Issue is exasperated if you reuse client credentials on more than one cluster running external DNS). Decreasing the interval is certainly a workaround but of course it does mean provisioning of services is impacted.

Would restructuring so that --domain-filter is used at the time records/zones are queried in the provider to only look at said zones, rather than just being used to filte records after they have been retrieved from the provider, or are there other considerations needed?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/external-dns/issues/484#issuecomment-467828983, or mute the thread https://github.com/notifications/unsubscribe-auth/AApv1KapNU2vNlte4D9_5fAOx4u2b5qwks5vRm2BgaJpZM4SfyeV .

Raffo avatar Feb 27 '19 11:02 Raffo

Correct, 0.5.11

Evesy avatar Feb 27 '19 12:02 Evesy

@Evesy it won't solve your problem completely, but we've been using the new --events flag introduced in this pull-request as a way to significantly reduce the number of regular poll calls to our provider while actually improving our provisioning time by combining --events with a long --interval. In our scenario, out of band DNS changes are unlikely, so has been working well for us.

jlamillan avatar Feb 27 '19 17:02 jlamillan

There are several key problems from looking over things and testing on a larger AWS deployments

  • Larger DNS hostedzones and low intervals often hit API transaction limits within AWS. Having event driven updates and using a longer --interval= time is ideal to reduce that impact - glad to see its being worked on in part
  • Larger hostedzones means that each operation above, still can hit a rate limit. Ideally can a 'sleep()' between each API call or groups of them during a 'scrape' of hostedzone contents be optionally used to have that operation more 'gentle' with AWS? Add a setting like --aws-operation-throttle_ops=25/s for larger deployments to place time delays or operations per second API calls wise? (set a max operations per second API rater limiter?)
  • Filters applied should constrain what API calls / queries are performed - so If I say only want .domain.com don't walk over other hosted zones serving other domains

rtkgjacobs avatar Apr 09 '19 12:04 rtkgjacobs

In our environment, we too are hitting rate limits on AWS. I have already increased our aws retries to 10 although now I am considering 13 with a much longer interval. We have added the -events support to combat the longer interval but that too can be rate limited. Which puts us back into the same situation. There are two different features that I am thinking about which:

  1. a separate retry interval on incomplete loops. With a larger interval, we cannot wait hours for a retry, there should be a separate back off for this type of situation.

  2. caching through the plan/apply process would reduce the total call count by 2 and best case is 3. This is were I see a quick win for something simple to implement. I would have like to use the cache support but that has issues in the face of failures so I am going to avoid that. I realize this creates two "caching" solutions but I view one safer than the other.

  3. handling multiple k8s clusters. This would also help greatly but is the most amount of change and even I don't want to go down this path yet.

fraenkel avatar Apr 24 '19 20:04 fraenkel

In our case we settled for one AWS account per cluster. Putting even just two k8s clusters on the same AWS account easily triggers the default rate limit. Thankfully we don't have that many so it's manageable this way. It also provides us with greater isolation and accounting across clusters so it's not like we did this solely for external-dns, but just saying...

tsuna avatar Apr 25 '19 01:04 tsuna

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jul 24 '19 02:07 fejta-bot

/remove-lifecycle stale

tbarrella avatar Jul 26 '19 06:07 tbarrella

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Oct 24 '19 07:10 fejta-bot

/remove-lifecycle stale

george-angel avatar Oct 25 '19 07:10 george-angel

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jan 23 '20 07:01 fejta-bot

/remove-lifecycle stale

george-angel avatar Jan 23 '20 08:01 george-angel

@Evesy it won't solve your problem completely, but we've been using the new --events flag introduced in this pull-request as a way to significantly reduce the number of regular poll calls to our provider while actually improving our provisioning time by combining --events with a long --interval. In our scenario, out of band DNS changes are unlikely, so has been working well for us.

FYI, support for --events flag has been merged to master, which triggers a sync loop when an Ingress/Service is added, updated, or deleted.

jlamillan avatar Feb 06 '20 02:02 jlamillan

@jlamillan will this be made into 0.5.19?

stevefan1999-personal avatar Feb 09 '20 08:02 stevefan1999-personal

maybe this can be closed? Current release is v0.7.1

Though I'd like to request that flags like these are exposed somewhere. They don't appear to be documented anywhere.

ghostsquad avatar Apr 07 '20 19:04 ghostsquad

I think so. The --events flag is available starting in v0.6.0. The setting is also available (as triggerLoopOnEvent) in version 2.18.0+ of the the Helm chart for external-dns.

jlamillan avatar Apr 07 '20 19:04 jlamillan

Hey everyone. 0.7.2 should fix this issue as polling interval is preserved even if --events is used (i.e. synchronization happens as soon as event happens, but no more than once per interval). Could you confirm it's fixed?

sheerun avatar Jun 04 '20 11:06 sheerun

I changed the --interval=3m and set --events flag, yet I am getting

time="2020-06-25T11:14:21Z" level=error msg="Throttling: Rate exceeded\n\tstatus code: 400,

what's the recommended --interval value to mitigate this ?

ipochi avatar Jun 25 '20 11:06 ipochi

@ipochi What version of external dns you are using?

sheerun avatar Jun 27 '20 17:06 sheerun

@ipochi What version of external dns you are using?

@sheerun 0.7.2-debian-10-r20

ipochi avatar Jun 27 '20 17:06 ipochi

Could you try to write steps to reproduce on docker image?

sheerun avatar Jun 28 '20 07:06 sheerun