external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

If one source fails, continue to the next source instead of bailing

Open olemarkus opened this issue 2 years ago • 9 comments

Description

Related to #3008, when external-dns is configured with multiple sources and one of them fails, all other sources are skipped as well. This PR logs and continues rather than aborting completely and log.

olemarkus avatar Sep 07 '22 13:09 olemarkus

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: olemarkus Once this PR has been reviewed and has the lgtm label, please assign raffo for approval by writing /assign @raffo in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 07 '22 13:09 k8s-ci-robot

@olemarkus in general I don't like the idea of ignoring an error in one source. Consider the update controller scenario were you run current version A and update to version B. In A all sources M,N and K work, but in version B source M does not work for some reason. I think safety is to fail fast in this case.

szuecs avatar Sep 07 '22 13:09 szuecs

Is that to increase the visibility of the fail?

In our case, where the ambassador host source failed, it broke all the other sources as well. This means it's very easy for one cluster tenant to break external-dns for everyone else. Of course if the sources behaves more nicely, such as ambassador hosts after #3008, then it makes sense to fail faster. Maybe.

olemarkus avatar Sep 07 '22 15:09 olemarkus

If you for example deploy a new version of external-dns it should either work or break, but it should not delete records if one source can not be fetched, which would likely break half of the cluster ingress. So my point is that it's more severe to break partially without notice then failing completely. Operators can alert based on control loop pod is in the right state and will notice in case it's crash looping. Then logs will show what caused the break and they can get this fixed or manually stop using the source if this is applicable for their environment.

szuecs avatar Sep 07 '22 18:09 szuecs

I could agree if it did crashloop. But it doesn't. It just logs and restarts the iteration. It ends up on this line: https://github.com/kubernetes-sigs/external-dns/blob/master/controller/controller.go#L295

It may very well be that one should os.exit(1) if the controller returns error all the way in there.

olemarkus avatar Sep 07 '22 19:09 olemarkus

@olemarkus great catch that one should be a fatal! @Raffo @njuettner can you check if you agree here (last comment and linked source (error->fatal)?

szuecs avatar Sep 07 '22 19:09 szuecs

How do we proceed with this?

olemarkus avatar Sep 14 '22 11:09 olemarkus

@olemarkus do you mean the change to fatal?

szuecs avatar Sep 14 '22 22:09 szuecs

Either this PR or change to fatal, yes.

olemarkus avatar Sep 15 '22 04:09 olemarkus

Fatal would be preferred

szuecs avatar Nov 10 '22 13:11 szuecs

Done.

olemarkus avatar Nov 20 '22 14:11 olemarkus

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 18 '23 14:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 20 '23 15:03 k8s-triage-robot

/remove-lifecycle rotten

olemarkus avatar Mar 23 '23 06:03 olemarkus

/lgtm

johngmyers avatar May 07 '23 21:05 johngmyers

/approve

szuecs avatar May 08 '23 11:05 szuecs

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, szuecs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar May 08 '23 11:05 k8s-ci-robot

This seems to be partially responsible for #3754. I think this change should be reverted. What's the point of a reconciling controller if it exits and crashes on every error it encounters?

alfredkrohmer avatar Aug 29 '23 08:08 alfredkrohmer

Agreed, this completely breaks the controller. So if any errors occur, the controller just blows up, after never doing that since it's inception? Terrible idea.

jbilliau-rcd avatar Aug 29 '23 13:08 jbilliau-rcd

As mentioned in the first comment, my original PR was to log and carry on, but this approach was preferred instead. In the long run I do think this approach is better, but it may be a bit painful in the beginning as some source/providers may propagate errors as an acceptable path.

olemarkus avatar Sep 09 '23 07:09 olemarkus

Sorry, I'm confused....how is this approach better? As far as I can tell, the controller is completely broken now and non-functional, as ANY issues with creating DNS records cause the entire thing to blow up. We have a ton of clusters that have multiple teams and multiple apps on them. As it is today, if one batch of records is erroring, it doesn't affect anything else....other teams can still deploy Ingress objects and have their records created. Now, if ONE batch of records is incorrect, the entire controller now ceases to function.

In an attempt to make an analogy, imagine if AWS functioned so that if someone created a s3 bucket with a bad bucket policy, no one could then create an s3 bucket as the s3 console would crash and become non-functional. How would that ever be a good idea?

jbilliau-rcd avatar Sep 11 '23 13:09 jbilliau-rcd

I agree, fail fast is great on a single request but in the case of a live-forever controller it would make more sense to log and continue. There's no guarantee that all good resources will be serviced before the bad resource forces an exit -- the controller is orphaning good resources.

I will be downgrading my external-dns and not participating in this flavor of execution. Please consider making the fail-fast opt-out through a command line argument.

xrl avatar Nov 15 '23 21:11 xrl

I just wanted to add my two cents on this. We are using AWS EKS and Route53. I deployed a set of Kubernetes Ingress objects with annotations to create records and external-dns created them without issue. However, on the next run of updates, the external-dns pod kept crashing with a fatal error complaining that those same records already exist. I wasted hours troubleshooting this problem before finding this PR.

The external-dns controller should not even be complaining about records already existing, let alone completely crashing, especially when it created those records in the first place. This is a bad design choice and this PR should either be reverted entirely or reworked so that the controller does not explode when it finds records that it created.

I have pinned my external-dns chart version to v1.12.2 until this bug is fixed.

ctwilleager-alio avatar Mar 12 '24 19:03 ctwilleager-alio