external-dns
external-dns copied to clipboard
If one source fails, continue to the next source instead of bailing
Description
Related to #3008, when external-dns is configured with multiple sources and one of them fails, all other sources are skipped as well. This PR logs and continues rather than aborting completely and log.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: olemarkus
Once this PR has been reviewed and has the lgtm label, please assign raffo for approval by writing /assign @raffo
in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@olemarkus in general I don't like the idea of ignoring an error in one source. Consider the update controller scenario were you run current version A and update to version B. In A all sources M,N and K work, but in version B source M does not work for some reason. I think safety is to fail fast in this case.
Is that to increase the visibility of the fail?
In our case, where the ambassador host source failed, it broke all the other sources as well. This means it's very easy for one cluster tenant to break external-dns for everyone else. Of course if the sources behaves more nicely, such as ambassador hosts after #3008, then it makes sense to fail faster. Maybe.
If you for example deploy a new version of external-dns it should either work or break, but it should not delete records if one source can not be fetched, which would likely break half of the cluster ingress. So my point is that it's more severe to break partially without notice then failing completely. Operators can alert based on control loop pod is in the right state and will notice in case it's crash looping. Then logs will show what caused the break and they can get this fixed or manually stop using the source if this is applicable for their environment.
I could agree if it did crashloop. But it doesn't. It just logs and restarts the iteration. It ends up on this line: https://github.com/kubernetes-sigs/external-dns/blob/master/controller/controller.go#L295
It may very well be that one should os.exit(1)
if the controller returns error all the way in there.
@olemarkus great catch that one should be a fatal! @Raffo @njuettner can you check if you agree here (last comment and linked source (error->fatal)?
How do we proceed with this?
@olemarkus do you mean the change to fatal?
Either this PR or change to fatal, yes.
Fatal would be preferred
Done.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle rotten
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: olemarkus, szuecs
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [szuecs]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
This seems to be partially responsible for #3754. I think this change should be reverted. What's the point of a reconciling controller if it exits and crashes on every error it encounters?
Agreed, this completely breaks the controller. So if any errors occur, the controller just blows up, after never doing that since it's inception? Terrible idea.
As mentioned in the first comment, my original PR was to log and carry on, but this approach was preferred instead. In the long run I do think this approach is better, but it may be a bit painful in the beginning as some source/providers may propagate errors as an acceptable path.
Sorry, I'm confused....how is this approach better? As far as I can tell, the controller is completely broken now and non-functional, as ANY issues with creating DNS records cause the entire thing to blow up. We have a ton of clusters that have multiple teams and multiple apps on them. As it is today, if one batch of records is erroring, it doesn't affect anything else....other teams can still deploy Ingress
objects and have their records created. Now, if ONE batch of records is incorrect, the entire controller now ceases to function.
In an attempt to make an analogy, imagine if AWS functioned so that if someone created a s3 bucket with a bad bucket policy, no one could then create an s3 bucket as the s3 console would crash and become non-functional. How would that ever be a good idea?
I agree, fail fast is great on a single request but in the case of a live-forever controller it would make more sense to log and continue. There's no guarantee that all good resources will be serviced before the bad resource forces an exit -- the controller is orphaning good resources.
I will be downgrading my external-dns and not participating in this flavor of execution. Please consider making the fail-fast opt-out through a command line argument.
I just wanted to add my two cents on this. We are using AWS EKS and Route53. I deployed a set of Kubernetes Ingress objects with annotations to create records and external-dns created them without issue. However, on the next run of updates, the external-dns pod kept crashing with a fatal error complaining that those same records already exist. I wasted hours troubleshooting this problem before finding this PR.
The external-dns controller should not even be complaining about records already existing, let alone completely crashing, especially when it created those records in the first place. This is a bad design choice and this PR should either be reverted entirely or reworked so that the controller does not explode when it finds records that it created.
I have pinned my external-dns chart version to v1.12.2 until this bug is fixed.