[ BE 23.10.2, 23.10.3, 24.4.2 ] Automatic fail-over to a Fallback Gateway fails [WORKAROUND]
Original title was: [ BE 23.10.2 ] "Dual Stack" IPv4 + (dynamic) IPv6: Automatic fail-over to a Fallback Gateway fails
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
- [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
- [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue
Describe the bug
OPNsense Business Edition 23.10.2 : "Dual Stack" IPv4 + (dynamic) IPv6: Automatic fail-over to another IPv4 Fallback Gateway fails; Re-Boot needed
To Reproduce
Primary Network:
[ FttB ] M-Net Premium IP MGA „100 Mbit/s"
[ VDSL ] PPPoE via VLAN tag 40
"Dual Stack":
IPv4: fixed
IPv6: dynamic
providing two gateways:
MNET_PPPOE (IPv4)
MNET_DHCP6 (IPv6)
Secondary Network:
[ W-Lan Router ]
[ VDSL ] "1 und 1" (via Deutsche Telekom)
"DS-Lite"
providing one "Fallback" gateway:
FB_GWv4 (IPv4)
The configuration was created following
- "Practical_OPNsense__4th_2023"
- https://docs.opnsense.org/
o) Boot - fully connected:
Lobby:Dashboard:
MNET ^ (up)
Fallback ^ (up)
System:Gateways:Single
MNET_PPPOE (IPv4) (active) 101 (upstream) Online
MNET_DHCP6 (IPv6) (active) 102 (upstream) Offline <--- !!!
FB_GWv4 (IPv4) (active) 111 (upstream) Online
via ssh: Routing:
Default >>> MNET_PPPOE (IPv4)
Default >>> MNET_DHCP6 (IPv6)
Notabene:
default fe80::aaaa:bbbb:cc UG pppoe1
via ssh: dpinger:
/usr/local/bin/dpinger -f -S -r 0 -i Fb_GWv4 ...
/usr/local/bin/dpinger -f -S -r 0 -i MNET_PPPOE ...
but no dpinger directed at <--- !!!
Monitor-IP for MNET_DHCP6 (IPv6) <--- !!!
o) Disconnect DSL-MoDem <---> OPNsense-FW
System:Gatew<ys:Single
MNET_DHCP6 (IPv6) (active) 102 (upstream) Offline <--- !!!
FB_GWv4 (IPv4) (active) 111 (upstream) Online
MNET_PPPOE (IPv4) defunct (upstream) Offline
via ssh: Routing:
Default >>> MNET_DHCP6 (IPv6) <--- !!!
==> DNS lookup (IPv4) fails
All the multiple "usual suspects" solutions suggested in multiple former Forae Feeds fail; MNET_DHCP6 does not change to "defunct"
o) Reboot helps - but strange:
Lobby:Dashboard:
MNET ^ (up) <--- !!!
via ssh: Routing:
No Default >>> IPv6
Default >>> Fallback (IPv4)
o) Re-connect:
Lobby:Dashboard:
MNET ^ (up)
System:Gateways:Single
MNET_PPPOE (IPv4) (active) 101 (upstream) Online
MNET_DHCP6 (IPv6) (active) 102 (upstream) Offline <--- !!!
FB_GWv4 (IPv4) (active) 111 (upstream) Online
Restore <- Fallback successful:
Default >>> MNET_PPPOE (IPv4)
Default >>> MNET_DHCP6 (IPv6)
o) Reboot:
Same as beginning - as to be expected:
|: . . . :| ad lib
Expected behavior
In order to secure the (remote) connection, automatic fail-over should not need a Reboot to function properly.
Describe alternatives you considered
Screenshots
Relevant log files
Additional context
Environment
OPNsense Business Edition 23.10.2-amd64
Might be relevant:
dpinger fails to restart after a PPPoE link recycle https://forum.opnsense.org/index.php?topic=37948.msg186016#msg186016
dpinger restart required to recover WAN https://forum.opnsense.org/index.php?action=post;quote=191750;topic=39149.0
dpinger : -> issues: none https://github.com/dennypage/dpinger
system: adjust to dpinger reality in latency/loss handing #6231
The IPv6 gateway is reported as offline if the WAN interface has a ULA address. #6939
Hint: In order to make the above work at least via reboot, I had to configure fix (*) addresses into . . . System: Settings: General: "DNS Servers" and disable . . . "DNS server options" . . . - . . . -> "Allow DNS server list to be overridden by DHCP/PPP on WAN"
(*) dns0.eu, primary and secondary, IPv4 and IPv6: . . . 193.110.81.0 . . . 185.253.5.0 . . . 2a0f:fc80:: . . . 2a0f:fc81::
What's the actual issue? I don't see any logs. It's probably not finding an address to monitor from?
Core Issue: Problem with (dynamic) IPv6 gateway handling, esp. default route handling
As described, changing from completely working to a broken primary connection:
Name: MNET_DHCP6 (active) <----- ! it should not any more Interface: MNet Protocol: IPv6 Priority: 192 (upstream) <----- ! This should be "defunct" ! Gateway: - Monitor IP: 2a02:2e0:3fe:1001:302:: <----- not available any more RTT: 0.0 ms RTTd: 0.0 ms Loss: 100 % Status: Offline Description: Interface MNET_DHCP6 Gateway
This surviving gateway keeps it's default IPv6 route: default fe80::46ec:ceff:fe UG pppoe1
Name: MNET_PPPOE : Priority: defunct (upstream)
The broken IPv4 default route has been correctly removed.
Name: Fb_GWv4 (active) Priority: 11 (upstream) Status: Online
The corresponding IPv4 default route does not get created! <----- !
Thus, for all IPv4 based local networks, there is "no way out".
Only via a REBOOT, the broken IPv4 default route has been correctly removed and the correct corresponding IPv4 default route created instead, rendering a working Fallback (IPv4) provision.
logs: Please, be so kind to specify your requests.
Even after the REBOOT:
Name: MNET_DHCP6 (active) <----- ! it should not be ! Priority: 192 (upstream) <----- ! This should be "defunct" !
System: Log Files: General would probably be a good start.
grep MNET_PPPOE /var/log/gateways/latest.log | grep -v "sendto error:" grep MNET_DHCP6 /var/log/gateways/latest.log | grep -v "sendto error:"
2024-03-18T12:52:28+01:00
Both MONITOR had correctly detected . . . Alarm: none -> loss . . . Alarm: loss -> down
Let's switch to a more productive and streamlined effort and start with the log please.
! Mid-Air collision !
Not wanting to copy'n'paste that here - how am I supposed to extract all that info displayed from your web page into a file, presumably in order to attach that with [ @ ] ?
Relief: Just found the tiny little "download selection" button at the very end ;-)
Just need these lines to begin with:
# opnsense-log | grep skipping
Cheers, Franco
opnsense-log | grep skipping
<12>1 2024-03-18T12:11:56+01:00 scrat.maknit-sendling.de opnsense-business 26800 - [meta sequenceId="78"] /interfaces.php: The required MNET_DHCP6 IPv6 interface address could not be found, skipping.
<12>1 2024-03-18T13:15:57+01:00 scrat.maknit-sendling.de opnsense-business 311 - [meta sequenceId="326"] /usr/local/etc/rc.bootup: The required MNET_DHCP6 IPv6 interface address could not be found, skipping.
<12>1 2024-03-18T13:15:57+01:00 scrat.maknit-sendling.de opnsense-business 311 - [meta sequenceId="329"] /usr/local/etc/rc.bootup: The required MNET_PPPOE IPv4 interface address could not be found, skipping.
Ok that's the issue... now when you have a running dpinger for MNET_DHCP6 what does it use?
# ps auxwww | grep MNET_DHCP6
Also, rerunning the monitor init probably fixes it temporarily?
# pluginctl monitor
I should bring back the missing monitor. With this out of the way we could start looking for a reason.
Cleared the log and re-booted with DSL-MoDem attached, for a clean start, IPv6 working:
. # ping -6 heise.de <--- Monitor-IP PING6(56=40+8+8 bytes) 2001:a61:2a0a:e906:20e:cff:febc:7262 --> 2a02:2e0:3fe:1001:302:: 16 bytes from 2a02:2e0:3fe:1001:302::, icmp_seq=1 hlim=58 time=10.823 ms
0:00.00 /usr/local/bin/dpinger -f -S -r 0 -i MNET_DHCP6 -B 2001:a61:2a0a:e906:20e:cff:febc:7262 -p /var/run/dpinger_MNET_DHCP6.pid -u /var/run/dpinger_MNET_DHCP6.sock -s 1s -l 4s -t 60s -d 0 2a02:2e0:3fe:1001:302::
DETACHING the cable between DSL-MoDem and OPNsense FW now: . # date: Mon Mar 18 15:04:56 CET 2024
0:00.04 /usr/local/bin/dpinger -f -S -r 0 -i MNET_DHCP6 -B 2001:a61:2a0a:e906:20e:cff:febc:7262 -p /var/run/dpinger_MNET_DHCP6.pid -u /var/run/dpinger_MNET_DHCP6.sock -s 1s -l 4s -t 60s -d 0 2a02:2e0:3fe:1001:302::
. # pluginctl monitor Setting up gateway monitors...done.
RESULT: as before: . . . "faulty" IPv6 default route . . . no IPv4 default route
REBOOT: . # date: Mon Mar 18 15:11:07 CET 20
. # ps auxwww | grep dpinger root 38220 0.0 0.0 13340 2508 - Is 15:13 0:00.01 /usr/local/bin/dpinger -f -S -r 0 -i Fb_GWv4 -B 192.168.8.99 -p /var/run/dpinger_Fb_GWv4.pid -u /var/run/dpinger_Fb_GWv4.sock -s 1s -l 4s -t 60s -d 0 192.168.178.1
Just discovered that a reboot seems to reset "System: Log Files: General"
- REBOOT
- DETACH
- Save the Log: --> system__003.log
- RESET to re-gain a connection via Fallback
- Save the Log: --> system__004.log
HTH !
It's very hard to follow again. Attached files are not helpful either :(
? I just provided copies of the results of the commands you requested; the attached files are the "System: Log Files: General" you requested in first place.
{ Will be back in an hour, approx. }
I know but I said I'd like to keep this focused and we use github to record all information in text. attaching files and posting other metrics while working on diagnosis is detrimental to the support flow.
{ Sorry - TelCo took longer }
Understand. - I'd be happy to grep the parts of your interests and cite them 'in-line'.
M.Net contracts supply: . . . IPv4 (fixed) + IPv6 (dynamic) Telekom contracts supply: . . . IPv4 (fixed) + IPv6 Präfix (fixed)
Thus I ran the following comparison using
Interfaces: [Test_IPv6]
Identifier: opt5
Device: em1
IPv6 Configuration Type: Track Interface
IPv6 Interface: MNet
COMPARISON:
A) Interfaces: [MNet] : Request only an IPv6 prefix: [ ]
pppoe1: No inet6 2001: ... assigned
em1: inet6 2001: ... assigned <----- difference
Internet6:
default fe80::46ec:ceff:fe UG pppoe1
2001:AAA:BBBB:... link#6 U em1
... /usr/local/bin/dpinger -f -S -r 0 -i MNET_PPPOE ...
... /usr/local/bin/dpinger -f -S -r 0 -i Fb_GWv4 ...
B) Interfaces: [MNet] : Request only an IPv6 prefix: [x]
pppoe1: No inet6 2001: ... assigned
em1: No inet6 2001: ... assigned <----- difference
Internet6:
default fe80::c242:d0ff:fe UG pppoe1
2001:aaa:bbbb:cccc localhost USB lo0 <----- in addition
2001:aaa:bbbb:cccc link#6 U em1
2001:aaa:bbbb:cccc link#6 UHS lo0 <----- in addition
redirector.heise.d fe80::xxxx:yyyy:zz UGHS pppoe1 <----- in addition
... /usr/local/bin/dpinger -f -S -r 0 -i MNET_PPPOE ...
... /usr/local/bin/dpinger -f -S -r 0 -i MNET_DHCP6 ... <----- in addition
... /usr/local/bin/dpinger -f -S -r 0 -i Fb_GWv4 -B ...
Probably unrelated:
Concerning https://docs.opnsense.org/manual/radvd.html: In my case, no "» Services » Router Advertisements" submenu is being offered.
.# ps aux | grep radvd ... /usr/local/sbin/radvd -p /var/run/radvd.pid -C /var/etc/radvd.conf -m syslog
.# cat /var/etc/radvd.conf
# Automatically generated, do not edit
# Generated RADVD config for dhcp6 assignment from wan on opt5
interface em1 {
AdvSendAdvert on;
AdvLinkMTU 1492;
AdvManagedFlag on;
AdvOtherConfigFlag on;
prefix 2001:a61:aaaa:bbbb::/64 {
DeprecatePrefix on;
AdvOnLink on;
AdvAutonomous on;
};
RDNSS 2001:a61:2a0e:5b06:20e:cff:febc:7262 { };
DNSSL maknit-sendling.de { };
};
In both cases, "wieistmeineip.de" states:
Ihre IPv4-Adresse lautet: xxx.xxx.xxx.xxx
Ihre IPv6-Adresse lautet: nicht vorhanden <----- !
Test IPv4: OK
Test IPv6: fehlgeschlagen
Test Dual Stack: OK
Digging /var/log/system (excerpt):
/usr/local/etc/rc.newwanipv6: Failed to detect IP for interface wan
/usr/local/etc/rc.newwanipv6:
list ($ip) = interfaces_primary_address6($interface);
if (!is_ipaddr($ip)) {
log_msg("Failed to detect IP for interface {$interface}", LOG_INFO);
return;
}
Following the path: finally leading to #7202
Furthermore:
/system_gateways_edit.php: The command '/sbin/route delete -inet6 '2a02:2e0:3fe:1001:302::'' returned exit code '1', the output was 'route: route has not been found delete host 2a02:2e0:3fe:1001:302:: fib 0: not in table'
ERGO: 1.) Tickmark "Interfaces: [MNet]" -> "Request only an IPv6 prefix" 2.) "System: Gateways: Single" -> Remove the manual "Monitor IP" entry
grep /var/log/system: none of both entries above, "Failed" as well as "delete", visible so far any more.
But, DETACHING the cable between DSL-MoDem and OPNsense FW again:
/usr/local/etc/rc.newwanip: Failed to detect IP for interface wan
/usr/local/etc/rc.newwanipv6: Failed to detect IP for interface wan
unfortunately
MNET_DHCP6 (active) MNet IPv6 102 (upstream) 0.0 ms 0.0 ms 100.0 % Offline
still survives, as well as its corresponding routing entries
Internet6:
default fe80::c242:d0ff:fe UG pppoe1
2001:a61:2a12:6400 localhost USB lo0
2001:a61:2a12:6406 link#6 U em1
2001:a61:2a12:6406 link#6 UHS lo0
blocking DNS via IPv4. All three dpinger still running.
# pluginctl monitor
does not change the situation!
REBOOT:
yields a FW working IPv4 perfectly across its Fallback gateway, with
MNET_DHCP6 (active) MNet IPv6 102 (upstream) ~ ~ ~ Offline
and only one dpinger on Fb_GWv4, as to be expected.
RE-CONNECT:
Instantaneously and flawlessly delivers full connection again,
not another reboot needed,
three dpingers running,
MNET_DHCP6 (active) MNet IPv6 102 (upstream) fe80::c242:d0ff:fe91:abc0 fe80::c242:d0ff:fe91:abc0 6.2 ms 2.5 ms 0.0 % Online
The core problem still remains: Not only detect a broken dynamic IPv6 connection, but also handle the broken MNET_DHCP6 gateway as defunct and in consequence remove the obsolete IPv6 default route as well as the also obsolete IPv6 assignments.
Notabene:
As long as the broken MNET_DHCP6 keeps blocking, even an IPv6-capable Fallback would not remedy the situation, agnostic of that being "fixed" or "dynamic" version.
The situation might be different if the primary connection provides "fixed" IPv6. Unfortunately, ATM, I do not have such a test bed available.
@emesterhazy: Thank you
@fichtner: You have marked #7202 for Milestone 24.7. Will an update for BE take till 24.10 then?
Even worse:
Expecting a solid workaround by completely restraining to
using the (fixed) IPv4 connection only:
Interfaces: [MNet] : IPv6 Configuration Type: None
Interfaces: [Test_IPv6] : IPv6 Configuration Type: None
consistently deleting MNET_DHCP6 Gateway also.
Alas: DETACHING the cable between DSL-MoDem and OPNsense FW, the broken MNET_PPPOE IPv4 default route got removed, but, again, no default route was created at all; thus Fb_GWv4 (active) being correctly pulled up as highest priority now rests useless.
Neither "pluginctl monitor" nor waiting patiently for 7 minutes help.
Only after a REBOOT, the expected default route via Fb_GWv4 is available.
Very irritating: Lobby: Dashboard: Interfaces, "MNet" still is displayed as ^ (up) ! Interfaces: Overview: -> MNet interface (wan, pppoe1): . . . Status up . . . PPPoE up
Heureka! A very minimalist Workaround, still needing manual interference, but . . . circumventing any need to re-boot, . . . circumventing any need to hack via ssh:
A) "Broken Primary Connection, switch to Fallback Connection"
A.0) Remove any broken IPv6 assignments from Internet6 Routing tables:
Interfaces: Overview: MNet interface (wan, pppoe1): -> DHCP: "DHCPv6 up": -> "Release"
A.1) Create new default routing across Fallback Interface:
System: Gateways: Single: -> Fb_GWv4 (active) -> "Edit": just -> "Save"
System: Gateways: Single: -> "Apply changes"
B) "Primary Connection working again, switch-back from Fallback Connection"
B.1) Bring back MNET_PPPOE and re-create missing IPv6 (default) routing:
Interfaces: MNet: just -> "Save"
Interfaces: MNet: -> "Apply changes"
Tested with Intel® Ethernet Server Adapters I350 - T{2, 4} v2 : . . . via detaching the cable between DSL-MoDem and OPNsense FW . . . via re-attaching the cable between DSL-MoDem and OPNsense FW You can repeat ad libitum, If you like - without any reboot at all. Do not be irritated if pages like "wieistmeineip.de" need some time to recognize the new situation.
If no IPv6 is involved, (A.0) should be superfluous.
If dynamic IPv6 is involved, remember the prerequisites from above: 1.) Tickmark "Interfaces: [MNet]" -> "Request only an IPv6 prefix" 2.) "System: Gateways: Single" -> do not enter any (global) "Monitor IP" manually.
Both documentations should be upgraded to include these two non-obvious hints; concerning (1.), this was already suggested by @emesterhazy in #7202.
@fichtner: Franco, I hope this helps to further zero in on the real cause.
Thanks, and kind regards Manfred
Postscriptum:
AFAICS, in case of dynamic IPv6 where the Provider grants an IPv6 prefix only, above cited "error" message
The required MNET_DHCP6 IPv6 interface address could not be found is nothing to be wondered about:
There is none, according to contract.
@AdSchellevis : Really convinced of OPNsense BE for years now, recommending it's application for reliability and continuous development, please, understand humbly questioning labeling non-functionality of essential Firewall capability for "Community Support" - especially with respect to "Business Edition". Kind regards from Munich Respectfully yours Manfred
@Manfred-Knick I don't mind changing or removing the label, but currently there's not enough relevant data to label this as something else than support. Sometimes strange timing issues occur due to hardware issues, (very)slow modules or half functional network cards. When needed we do offer commercial support as well to debug specific issues.
A workaround for strange hardware related issues is sometimes to add a start hook (https://docs.opnsense.org/development/backend/autorun.html#syshook) which triggers an ifconfig XXX down && ifconfig XXX up.
If https://github.com/opnsense/core/issues/7202 includes a fix (some code has been pushed), it should be possible to try with the master branch using the latest community release (and switch to development). When this fixes it, changes will eventually flow into the business edition as well.
Further investigating multiple (IPv4) Fallback Gateways with descending GW priorities { 111 , 121 , ... }, applying (A) works - for the first fallback; but: "pulling the plug" on the first fallback GW (prio 111) as well, the following GW (prio 121, ...) does not get any default route; and (A.1) does not help any more.
System: Gateways: Single: -> <GW> -> "Edit": "Mark Gateway as Down"
only 'marking' without 'pulling',
I discovered that although MNET_PPPOE and MNET_DHCP6 had been marked as Down,
it was still possible to "ping -6" e.g. heise.de or dns.isp.t-ipnet.de across a gateway being in status Offline (forced) !
These working ping only got disabled after physically 'pulling the plug'.
Sorry, Ad, denouncing tested, reliably-working server-grade equipment will not hold any water at all.
Confirmation:
Utilizing multiple gateways in order to access different (static) net segments ( System: Routes: Configuration )
is easy and works reliable.
HTH All the best
This issue has been automatically timed-out (after 180 days of inactivity).
For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.
If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.