core icon indicating copy to clipboard operation
core copied to clipboard

Improve multi-WAN failover resiliency: multiple IP monitoring per gateway before taking down, and auto DHCP renewal when gateway comes back up (when using virtual Linux Bridges from Proxmox as interfaces)

Open kwand opened this issue 2 years ago • 1 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [X] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Is your feature request related to a problem? Please describe.

As you may have heard, there was a recent nationwide outage in Canada with one of our telecoms (Rogers), with very weird behaviours during the service restoration process. Thankfully, I do have a second slower connection to a different ISP (Bell); was finally convinced to setup multi-WAN failover yesterday and link that other connection to my Opnsense box.

However, I noticed some potential flaws in the process (which seem to be shared by many others in the pfSense/Opnsense community in real-world failures), as well as problems when switching over from the secondary gateway (Bell) when the primary finally came back up this morning (Rogers):

  1. Single monitor IP per gateway. This seems a bit problematic as:
    • (for example I'm using Google's DNS servers) if only the monitor IP were to go down, the gateway would be marked as down even though everything else is working. Now, I actually haven't heard of Google's DNS servers going down yet recently, but it's certainly not impossible and Cloudflare's DNS did experience an outage just last month (which forced me to add Google's DoT addresses in Unbound to restore internet access then). If Google's DNS were to go down, then my gateways would be marked down and would make the fact that I had both Google and Cloudflare's DNS servers in Unbound (to act as backups should either go down) rather pointless.
    • It's been suggested to use the first hop as the monitor IP instead of DNS servers, but I'm been told that they could change at any time which makes this unreliable. Using the ISP's DNS servers is not much of a solution as I'm using Unbound (and not using the ISP's DNS anywhere) - so when any DNS failure with my ISP technically should no longer affect me, choosing this solution would bring my gateway down.
    • Furthermore, monitoring anything within my ISP is problematic as for several hours during the outage, I was able to connect to my ISP but nothing outside of their network. I would have wanted opnsense to switch to my backup gateway in this case.
  2. DHCP renewal failure when gateway restored. This has probably more to do with the fact that I'm running Opnsense as a VM in Proxmox, as I experience the same problem whenever I unplug the cables to the ports; the interfaces are always marked as up since they are virtual Linux Bridges in Proxmox. However, I would still like for there to be an option to do DHCP renewal if a gateway goes down and comes back up.
    • The particular failure I experienced when my primary ISP's gateway came back up is that its modem previously decided to assign it a local IP (10.0.0.47) as it failed to receive an IPv4 WAN address from my ISP. I'm told adding the modem's IP to DHCP reject setting on my primary WAN interface would fix this, but since I'm pretty sure I experienced the same problem from unplugging cables (since the interfaces are always marked up), not quite sure if this would resolve this.
    • Currently, only a full reboot (or reloading the services) fixes this.

Describe the solution you like

  1. Use multiple monitor IPs per gateway to decide whether it is up or down. There have been multiple issues opened in both pfSense and opnsense repos/forums (see #4163, https://redmine.pfsense.org/issues/1189, https://forum.opnsense.org/index.php?topic=27355.0, https://forum.netgate.com/topic/84721/hack-for-multiple-ips-for-gateway-monitoring/2, etc.) over the years as well as multiple solutions being proposed (and by others more knowledgeable than me, so I will leave the details to them.) Though, I particularly like this solution proposed in the pfSense forums years ago:

B brainloss Sep 15, 2015, 5:31 AM

I would like to see a "proper" solution. Single IP monitoring is causing us no end of issues. Gateways being marked as down, but really the monitor IP has dissapeared, or ICMP is blocked but real world taffic tcp/udp is flowing perfectly.

My concept would include many IP's and have some weighted rules. Something like www.policyd-weight.org comes to mind. This would allow a list of say 20 IP's to monitor and allow for x number to be down and some marked as higher "number value" than others, then only mark the gateway as down if the sum of these values is below y. Could even use the same IPs for many gateways and if one ip down on one gateway the IP can be checked against another gateway.

I have no development skills, but would be willing to test and give feedback.

–Paul

  1. Allow for an option in the Gateway settings to reload the DHCP services when a gateway goes down and comes back up.

Describe alternatives you considered

I'm sure that I could put together 'hacks' using scripts and cron jobs to achieve what I want, but I would much prefer a GUI solution as this doesn't seem too complicated to implement. (Personally, I don't have much experience with networking CLI tools and FreeBSD. I'm more comfortable with OpenWRT (especially their uci command) and Linux networking - there's actually also a very easy way to achieve 1) with mwan3 in OpenWRT)

Running opnsense on bare metal to fix 2) (so interface link states are properly reported) is not an option as I also run transparent OpenWRT VMs for CAKE traffic shaping (due to the fact that it isn't available in FreeBSD yet)

Additional context

See:

  • https://github.com/opnsense/core/issues/4163 (Closed from timeout)
  • https://redmine.pfsense.org/issues/1189
  • https://www.reddit.com/r/PFSENSE/comments/s6vp9s/multiple_monitor_ips_per_gateway/
  • https://forum.opnsense.org/index.php?topic=27355.0
  • https://forum.netgate.com/topic/84721/hack-for-multiple-ips-for-gateway-monitoring
  • OpenWRT's solution: https://openwrt.org/docs/guide-user/network/wan/multiwan/mwan3

kwand avatar Jul 11 '22 00:07 kwand

Hi, I'm currently investigating if my problem correlates with yours: One of my WAN connections is a Zyxel 5G Modem NR7101 which failed two times the last days middle in the night and didn't recover itself. Tried a few things (unplug & restart the modem...) the last chance was to reboot the firewall. Perhaps it has to do with DHCP renewal, must investigating further.

ThomasTr avatar Jul 28 '22 08:07 ThomasTr

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.

OPNsense-bot avatar Jan 07 '23 00:01 OPNsense-bot