external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

Cloudflare 5XX responses crash pod

Open kevinfrommelt opened this issue 1 year ago • 8 comments

What happened: Cloudflare temporarily returned 5XXs for zone lookup calls, which caused the pods to crash.

What you expected to happen: Log an error without crashing and try again later.

How to reproduce it (as minimally and precisely as possible): Likely can't without mocking the API responses from Cloudflare.

Anything else we need to know?: Here are some relevant logs Screenshot 2024-11-14 at 9 32 25 AM

Environment:

  • External-DNS version (use external-dns --version): 0.14.2
  • DNS provider: cloudflare
  • Others:

kevinfrommelt avatar Nov 14 '24 16:11 kevinfrommelt

It is also crashing on rate limits for zone lookups.

zone fd823e13e494d5430aea9dfd4311a6e1 lookup failed, exceeded available rate limit retries
Failed to do run once: exceeded available rate limit retries

kevinfrommelt avatar Nov 14 '24 18:11 kevinfrommelt

Experiencing these crash as well

glaberge avatar Dec 27 '24 13:12 glaberge

/help

When not related example solution https://github.com/kubernetes-sigs/external-dns/pull/4573 for another provider

Ideally this will required a global rate limiter

ivankatliarchuk avatar Feb 01 '25 20:02 ivankatliarchuk

@ivankatliarchuk: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

When not related example solution https://github.com/kubernetes-sigs/external-dns/pull/4573 for another provider

Ideally this will required a global rate limiter

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Feb 01 '25 20:02 k8s-ci-robot

/assign

ivankatliarchuk avatar Feb 01 '25 20:02 ivankatliarchuk

Relates https://github.com/kubernetes-sigs/external-dns/issues/5225

ivankatliarchuk avatar Apr 05 '25 12:04 ivankatliarchuk

Hi,

I am also experiencing this issue on cloudflare provider with rate limiting, having the following fatal error message Failed to do run once: exceeded available rate limit retries, leading to a crash of the pod.

I have read #5225 about soft errors, and was surprised that this was not handled as a soft error, since there is some line of code checking explicitely for this

https://github.com/kubernetes-sigs/external-dns/blob/017f7687ca393138b565e51757a6fa8010902066/provider/cloudflare/cloudflare.go#L278

This function internal to https://github.com/cloudflare/cloudflare-go checks that the error type is ErrorTypeRateLimit ErrorType = "rate_limit"

However the error I have is generated by this line

https://github.com/cloudflare/cloudflare-go/blob/57714bfbdeea095ec27f4b9bafe65bda4f178d96/cloudflare.go#L266

and returned earlier that the use of theses static error types

https://github.com/cloudflare/cloudflare-go/blob/57714bfbdeea095ec27f4b9bafe65bda4f178d96/cloudflare.go#L286

I think that the problem is upstream at https://github.com/cloudflare/cloudflare-go, this kind of error should be seen as a rate limit, not a generic one.

I will open an issue on their side, but while waiting for this I don't know if we can have a solution in this repo, except maybe accept more errors as "soft errors" our side.

dixneuf19 avatar Apr 25 '25 14:04 dixneuf19

I have submitted a PR to fix this issue for rate limiting errors : https://github.com/kubernetes-sigs/external-dns/pull/5524

Hackatosh avatar Jun 13 '25 08:06 Hackatosh