anycast_healthchecker icon indicating copy to clipboard operation
anycast_healthchecker copied to clipboard

Feature request: withdraw advertisements on shutdown

Open and0x000 opened this issue 3 years ago • 4 comments

Is there a way to tell anycast-healthchecker to withdraw all announcements on a clean shutdown? Similar to purge_ip_prefixes but on exit?

My scenario is, that I want to be able to perform some maintenance without interrupting any service to much. Routers may take some time for announcements to converge. So if I shut down any healthchecked service it takes a few seconds before the healthchecker notices the service's unavailability and then again some time until the traffic no longer hits the system.

For a smooth transition my approach is to first withdraw all the routes on a system before shutting down any service.

Doing so by shutting down the anycast-healtchecker looks the cleanest to me. Everything else I can think of would be messing with the healthchecker and probably result in attempts by it to fix the configuration.

and0x000 avatar Oct 19 '21 13:10 and0x000

For your use-case the easiest and fastest way is to stop the bird daemon. It will yield what you want. Bird daemon is stopped during the shutdown process, so you don't need to do much with anycast-healthchecker.

unixsurfer avatar Oct 19 '21 14:10 unixsurfer

My NOC people get a little twitchy when BGP sessions are down, so I'd avoid taking them down for most of my use scenarios.

Bird daemon is stopped during the shutdown process, so you don't need to do much with anycast-healthchecker.

Yes, but that may cause service interruption as described above. Routes may not have converged into the routers' ASICs and traffic may still hit the machine while no service is up for responding.

From my point of view, an additional parameter on the checks would do the trick. Probably on_exit, similar to on_disabled.

on_exit => "withdraw" -> disable ip_prefix on exit. This requires itterating all checks in the shutdown method. If you don't see any problem with this I'll try to put it into code (albeit python not really being my native language) and start a PR.

and0x000 avatar Oct 20 '21 09:10 and0x000

My NOC people get a little twitchy when BGP sessions are down, so I'd avoid taking them down for most of my use scenarios.

I never had a problem with this approach and if NOC is having issues when a BGP session is terminated then something is wrong, terminating a BGP session is a normal operational task and it shouldn't cause troubles, only an alert.

Bird daemon is stopped during the shutdown process, so you don't need to do much with anycast-healthchecker.

Yes, but that may cause service interruption as described above. Routes may not have converged into the routers' ASICs and traffic may still hit the machine while no service is up for responding.

You can avoid this scenario with correct systemd ordering for Bird systemd service. I have had bird configured to start last on boot and stopped first on shutdown to avoid the scenario you describe.

From my point of view, an additional parameter on the checks would do the trick. Probably on_exit, similar to on_disabled.

on_exit => "withdraw" -> disable ip_prefix on exit. This requires itterating all checks in the shutdown method. If you don't see any problem with this I'll try to put it into code (albeit python not really being my native language) and start a PR.

Having on_exit parameter per service check makes sense, it should have a default value of none which does anything.

I will try to cook something this weekend, let's see if I manage to find time for it.

unixsurfer avatar Oct 20 '21 20:10 unixsurfer

I never had a problem with this approach and if NOC is having issues when a BGP session is terminated then something is wrong, terminating a BGP session is a normal operational task and it shouldn't cause troubles, only an alert.

You are right, but the alert is something I'd like to avoid for most use cases.

Having on_exit parameter per service check makes sense, it should have a default value of none which does anything.

I will try to cook something this weekend, let's see if I manage to find time for it.

I cobbled together a pull request but my python is far from any good. It's mostly copy/paste from your existing code with some stackoverflow sprinkled over it. It works but it's probably not very clean python. Feel free to adjust my code where there are more elegant ways.

and0x000 avatar Oct 21 '21 14:10 and0x000