core icon indicating copy to clipboard operation
core copied to clipboard

[ BE 23.10.2, 23.10.3, 24.4.2 ] Automatic fail-over to a Fallback Gateway fails [WORKAROUND]

Open Manfred-Knick opened this issue 1 year ago • 27 comments

Original title was: [ BE 23.10.2 ] "Dual Stack" IPv4 + (dynamic) IPv6: Automatic fail-over to a Fallback Gateway fails

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

OPNsense Business Edition 23.10.2 : "Dual Stack" IPv4 + (dynamic) IPv6: Automatic fail-over to another IPv4 Fallback Gateway fails; Re-Boot needed

To Reproduce

Primary Network:

[ FttB ]   M-Net Premium IP MGA „100 Mbit/s"
[ VDSL ]   PPPoE via VLAN tag 40
"Dual Stack":
    IPv4: fixed
    IPv6: dynamic

providing two gateways:
MNET_PPPOE (IPv4)
MNET_DHCP6 (IPv6)

Secondary Network:

[ W-Lan Router ]
[ VDSL ] "1 und 1"  (via Deutsche Telekom)
"DS-Lite"

providing one "Fallback" gateway:
FB_GWv4 (IPv4)

The configuration was created following

  • "Practical_OPNsense__4th_2023"
  • https://docs.opnsense.org/

o) Boot - fully connected:

Lobby:Dashboard:
MNET     ^ (up)
Fallback ^ (up)

System:Gateways:Single
MNET_PPPOE (IPv4) (active) 101 (upstream) Online
MNET_DHCP6 (IPv6) (active) 102 (upstream) Offline   <--- !!!
FB_GWv4 (IPv4)    (active) 111 (upstream) Online

via ssh: Routing:
Default >>> MNET_PPPOE (IPv4)
Default >>> MNET_DHCP6 (IPv6)

Notabene:
default            fe80::aaaa:bbbb:cc UG       pppoe1

via ssh: dpinger:
/usr/local/bin/dpinger -f -S -r 0 -i Fb_GWv4 ...
/usr/local/bin/dpinger -f -S -r 0 -i MNET_PPPOE ...

but no dpinger directed at                          <--- !!!
Monitor-IP for MNET_DHCP6 (IPv6)                    <--- !!!

o) Disconnect DSL-MoDem <---> OPNsense-FW

System:Gatew<ys:Single
MNET_DHCP6 (IPv6) (active) 102 (upstream) Offline   <--- !!!
FB_GWv4 (IPv4)    (active) 111 (upstream) Online
MNET_PPPOE (IPv4) defunct (upstream)      Offline

via ssh: Routing:
Default >>> MNET_DHCP6 (IPv6)                       <--- !!!
==> DNS lookup (IPv4) fails

All the multiple "usual suspects" solutions suggested in multiple former Forae Feeds fail; MNET_DHCP6 does not change to "defunct"

o) Reboot helps - but strange:

Lobby:Dashboard:
MNET ^ (up)                                         <--- !!!

via ssh: Routing:
No Default >>> IPv6
Default >>> Fallback (IPv4)

o) Re-connect:

Lobby:Dashboard:
MNET ^ (up)

System:Gateways:Single
MNET_PPPOE (IPv4) (active) 101 (upstream) Online
MNET_DHCP6 (IPv6) (active) 102 (upstream) Offline   <--- !!!
FB_GWv4 (IPv4)    (active) 111 (upstream) Online

Restore <- Fallback successful:
Default >>> MNET_PPPOE (IPv4)
Default >>> MNET_DHCP6 (IPv6)

o) Reboot:

Same as beginning - as to be expected:

|: . . . :|  ad lib

Expected behavior

In order to secure the (remote) connection, automatic fail-over should not need a Reboot to function properly.

Describe alternatives you considered

Screenshots

Relevant log files

Additional context

Environment

OPNsense Business Edition 23.10.2-amd64

Manfred-Knick avatar Mar 17 '24 15:03 Manfred-Knick

Might be relevant:

dpinger fails to restart after a PPPoE link recycle https://forum.opnsense.org/index.php?topic=37948.msg186016#msg186016

dpinger restart required to recover WAN https://forum.opnsense.org/index.php?action=post;quote=191750;topic=39149.0

dpinger : -> issues: none https://github.com/dennypage/dpinger

system: adjust to dpinger reality in latency/loss handing #6231

The IPv6 gateway is reported as offline if the WAN interface has a ULA address. #6939

Manfred-Knick avatar Mar 18 '24 10:03 Manfred-Knick

Hint: In order to make the above work at least via reboot, I had to configure fix (*) addresses into . . . System: Settings: General: "DNS Servers" and disable . . . "DNS server options" . . . - . . . -> "Allow DNS server list to be overridden by DHCP/PPP on WAN"

(*) dns0.eu, primary and secondary, IPv4 and IPv6: . . . 193.110.81.0 . . . 185.253.5.0 . . . 2a0f:fc80:: . . . 2a0f:fc81::

Manfred-Knick avatar Mar 18 '24 11:03 Manfred-Knick

What's the actual issue? I don't see any logs. It's probably not finding an address to monitor from?

fichtner avatar Mar 18 '24 11:03 fichtner

Core Issue: Problem with (dynamic) IPv6 gateway handling, esp. default route handling

As described, changing from completely working to a broken primary connection:

Name: MNET_DHCP6 (active) <----- ! it should not any more Interface: MNet Protocol: IPv6 Priority: 192 (upstream) <----- ! This should be "defunct" ! Gateway: - Monitor IP: 2a02:2e0:3fe:1001:302:: <----- not available any more RTT: 0.0 ms RTTd: 0.0 ms Loss: 100 % Status: Offline Description: Interface MNET_DHCP6 Gateway

This surviving gateway keeps it's default IPv6 route: default fe80::46ec:ceff:fe UG pppoe1

Name: MNET_PPPOE : Priority: defunct (upstream)

The broken IPv4 default route has been correctly removed.

Name: Fb_GWv4 (active) Priority: 11 (upstream) Status: Online

The corresponding IPv4 default route does not get created! <----- !

Thus, for all IPv4 based local networks, there is "no way out".

Only via a REBOOT, the broken IPv4 default route has been correctly removed and the correct corresponding IPv4 default route created instead, rendering a working Fallback (IPv4) provision.

Manfred-Knick avatar Mar 18 '24 12:03 Manfred-Knick

logs: Please, be so kind to specify your requests.

Manfred-Knick avatar Mar 18 '24 12:03 Manfred-Knick

Even after the REBOOT:

Name: MNET_DHCP6 (active) <----- ! it should not be ! Priority: 192 (upstream) <----- ! This should be "defunct" !

Manfred-Knick avatar Mar 18 '24 12:03 Manfred-Knick

System: Log Files: General would probably be a good start.

fichtner avatar Mar 18 '24 12:03 fichtner

grep MNET_PPPOE /var/log/gateways/latest.log | grep -v "sendto error:" grep MNET_DHCP6 /var/log/gateways/latest.log | grep -v "sendto error:"

2024-03-18T12:52:28+01:00

Both MONITOR had correctly detected . . . Alarm: none -> loss . . . Alarm: loss -> down

Manfred-Knick avatar Mar 18 '24 12:03 Manfred-Knick

Let's switch to a more productive and streamlined effort and start with the log please.

fichtner avatar Mar 18 '24 13:03 fichtner

! Mid-Air collision !

Not wanting to copy'n'paste that here - how am I supposed to extract all that info displayed from your web page into a file, presumably in order to attach that with [ @ ] ?

Relief: Just found the tiny little "download selection" button at the very end ;-)

Manfred-Knick avatar Mar 18 '24 13:03 Manfred-Knick

Just need these lines to begin with:

# opnsense-log | grep skipping

Cheers, Franco

fichtner avatar Mar 18 '24 13:03 fichtner

opnsense-log | grep skipping

<12>1 2024-03-18T12:11:56+01:00 scrat.maknit-sendling.de opnsense-business 26800 - [meta sequenceId="78"] /interfaces.php: The required MNET_DHCP6 IPv6 interface address could not be found, skipping.

<12>1 2024-03-18T13:15:57+01:00 scrat.maknit-sendling.de opnsense-business 311 - [meta sequenceId="326"] /usr/local/etc/rc.bootup: The required MNET_DHCP6 IPv6 interface address could not be found, skipping.

<12>1 2024-03-18T13:15:57+01:00 scrat.maknit-sendling.de opnsense-business 311 - [meta sequenceId="329"] /usr/local/etc/rc.bootup: The required MNET_PPPOE IPv4 interface address could not be found, skipping.

system.log

Manfred-Knick avatar Mar 18 '24 13:03 Manfred-Knick

Ok that's the issue... now when you have a running dpinger for MNET_DHCP6 what does it use?

# ps auxwww | grep MNET_DHCP6

Also, rerunning the monitor init probably fixes it temporarily?

# pluginctl monitor

I should bring back the missing monitor. With this out of the way we could start looking for a reason.

fichtner avatar Mar 18 '24 13:03 fichtner

Cleared the log and re-booted with DSL-MoDem attached, for a clean start, IPv6 working:

. # ping -6 heise.de <--- Monitor-IP PING6(56=40+8+8 bytes) 2001:a61:2a0a:e906:20e:cff:febc:7262 --> 2a02:2e0:3fe:1001:302:: 16 bytes from 2a02:2e0:3fe:1001:302::, icmp_seq=1 hlim=58 time=10.823 ms

0:00.00 /usr/local/bin/dpinger -f -S -r 0 -i MNET_DHCP6 -B 2001:a61:2a0a:e906:20e:cff:febc:7262 -p /var/run/dpinger_MNET_DHCP6.pid -u /var/run/dpinger_MNET_DHCP6.sock -s 1s -l 4s -t 60s -d 0 2a02:2e0:3fe:1001:302::

DETACHING the cable between DSL-MoDem and OPNsense FW now: . # date: Mon Mar 18 15:04:56 CET 2024

0:00.04 /usr/local/bin/dpinger -f -S -r 0 -i MNET_DHCP6 -B 2001:a61:2a0a:e906:20e:cff:febc:7262 -p /var/run/dpinger_MNET_DHCP6.pid -u /var/run/dpinger_MNET_DHCP6.sock -s 1s -l 4s -t 60s -d 0 2a02:2e0:3fe:1001:302::

. # pluginctl monitor Setting up gateway monitors...done.

RESULT: as before: . . . "faulty" IPv6 default route . . . no IPv4 default route

REBOOT: . # date: Mon Mar 18 15:11:07 CET 20

. # ps auxwww | grep dpinger root 38220 0.0 0.0 13340 2508 - Is 15:13 0:00.01 /usr/local/bin/dpinger -f -S -r 0 -i Fb_GWv4 -B 192.168.8.99 -p /var/run/dpinger_Fb_GWv4.pid -u /var/run/dpinger_Fb_GWv4.sock -s 1s -l 4s -t 60s -d 0 192.168.178.1

system__002.log

Manfred-Knick avatar Mar 18 '24 14:03 Manfred-Knick

Just discovered that a reboot seems to reset "System: Log Files: General"

  • REBOOT
  • DETACH
  • Save the Log: --> system__003.log
  • RESET to re-gain a connection via Fallback
  • Save the Log: --> system__004.log

system__003.log

system__004.log

HTH !

Manfred-Knick avatar Mar 18 '24 14:03 Manfred-Knick

It's very hard to follow again. Attached files are not helpful either :(

fichtner avatar Mar 18 '24 14:03 fichtner

? I just provided copies of the results of the commands you requested; the attached files are the "System: Log Files: General" you requested in first place.

{ Will be back in an hour, approx. }

Manfred-Knick avatar Mar 18 '24 14:03 Manfred-Knick

I know but I said I'd like to keep this focused and we use github to record all information in text. attaching files and posting other metrics while working on diagnosis is detrimental to the support flow.

fichtner avatar Mar 18 '24 14:03 fichtner

{ Sorry - TelCo took longer }

Understand. - I'd be happy to grep the parts of your interests and cite them 'in-line'.

Manfred-Knick avatar Mar 18 '24 16:03 Manfred-Knick

M.Net contracts supply: . . . IPv4 (fixed) + IPv6 (dynamic) Telekom contracts supply: . . . IPv4 (fixed) + IPv6 Präfix (fixed)

Thus I ran the following comparison using

Interfaces: [Test_IPv6]
Identifier: opt5
Device: em1
IPv6 Configuration Type: Track Interface
IPv6 Interface: MNet

COMPARISON:

A) Interfaces: [MNet] : Request only an IPv6 prefix: [ ]

pppoe1: No inet6 2001: ... assigned
em1:       inet6 2001: ... assigned                     <----- difference

Internet6:
default            fe80::46ec:ceff:fe UG       pppoe1
2001:AAA:BBBB:...  link#6             U           em1

... /usr/local/bin/dpinger -f -S -r 0 -i MNET_PPPOE ...
... /usr/local/bin/dpinger -f -S -r 0 -i Fb_GWv4    ...

B) Interfaces: [MNet] : Request only an IPv6 prefix: [x]

pppoe1: No inet6 2001: ... assigned
em1:    No inet6 2001: ... assigned                     <----- difference

Internet6:
default            fe80::c242:d0ff:fe UG       pppoe1
2001:aaa:bbbb:cccc localhost          USB         lo0   <----- in addition
2001:aaa:bbbb:cccc link#6             U           em1
2001:aaa:bbbb:cccc link#6             UHS         lo0   <----- in addition
redirector.heise.d fe80::xxxx:yyyy:zz UGHS     pppoe1   <----- in addition

... /usr/local/bin/dpinger -f -S -r 0 -i MNET_PPPOE ...
... /usr/local/bin/dpinger -f -S -r 0 -i MNET_DHCP6 ... <----- in addition
... /usr/local/bin/dpinger -f -S -r 0 -i Fb_GWv4 -B ...

Probably unrelated:

Concerning https://docs.opnsense.org/manual/radvd.html: In my case, no "» Services » Router Advertisements" submenu is being offered.

.# ps aux | grep radvd ... /usr/local/sbin/radvd -p /var/run/radvd.pid -C /var/etc/radvd.conf -m syslog

.# cat /var/etc/radvd.conf

# Automatically generated, do not edit
# Generated RADVD config for dhcp6 assignment from wan on opt5
interface em1 {
	AdvSendAdvert on;
	AdvLinkMTU 1492;
	AdvManagedFlag on;
	AdvOtherConfigFlag on;
	prefix 2001:a61:aaaa:bbbb::/64 {
		DeprecatePrefix on;
		AdvOnLink on;
		AdvAutonomous on;
	};
	RDNSS 2001:a61:2a0e:5b06:20e:cff:febc:7262 { };
	DNSSL maknit-sendling.de { };
};

In both cases, "wieistmeineip.de" states:

Ihre IPv4-Adresse lautet: 	xxx.xxx.xxx.xxx
Ihre IPv6-Adresse lautet: 	nicht vorhanden     <----- !
Test IPv4: 	OK
Test IPv6: 	fehlgeschlagen
Test Dual Stack: 	OK

Manfred-Knick avatar Mar 18 '24 19:03 Manfred-Knick

Digging /var/log/system (excerpt): /usr/local/etc/rc.newwanipv6: Failed to detect IP for interface wan

/usr/local/etc/rc.newwanipv6:

list ($ip) = interfaces_primary_address6($interface);
if (!is_ipaddr($ip)) {
    log_msg("Failed to detect IP for interface {$interface}", LOG_INFO);
    return;
}

Following the path: finally leading to #7202

Furthermore: /system_gateways_edit.php: The command '/sbin/route delete -inet6 '2a02:2e0:3fe:1001:302::'' returned exit code '1', the output was 'route: route has not been found delete host 2a02:2e0:3fe:1001:302:: fib 0: not in table'

ERGO: 1.) Tickmark "Interfaces: [MNet]" -> "Request only an IPv6 prefix" 2.) "System: Gateways: Single" -> Remove the manual "Monitor IP" entry

grep /var/log/system: none of both entries above, "Failed" as well as "delete", visible so far any more.

But, DETACHING the cable between DSL-MoDem and OPNsense FW again: /usr/local/etc/rc.newwanip: Failed to detect IP for interface wan /usr/local/etc/rc.newwanipv6: Failed to detect IP for interface wan unfortunately MNET_DHCP6 (active) MNet IPv6 102 (upstream) 0.0 ms 0.0 ms 100.0 % Offline still survives, as well as its corresponding routing entries

Internet6:
default            fe80::c242:d0ff:fe UG       pppoe1
2001:a61:2a12:6400 localhost          USB         lo0
2001:a61:2a12:6406 link#6             U           em1
2001:a61:2a12:6406 link#6             UHS         lo0

blocking DNS via IPv4. All three dpinger still running.

# pluginctl monitor does not change the situation!

REBOOT: yields a FW working IPv4 perfectly across its Fallback gateway, with MNET_DHCP6 (active) MNet IPv6 102 (upstream) ~ ~ ~ Offline and only one dpinger on Fb_GWv4, as to be expected.

RE-CONNECT: Instantaneously and flawlessly delivers full connection again, not another reboot needed, three dpingers running, MNET_DHCP6 (active) MNet IPv6 102 (upstream) fe80::c242:d0ff:fe91:abc0 fe80::c242:d0ff:fe91:abc0 6.2 ms 2.5 ms 0.0 % Online

The core problem still remains: Not only detect a broken dynamic IPv6 connection, but also handle the broken MNET_DHCP6 gateway as defunct and in consequence remove the obsolete IPv6 default route as well as the also obsolete IPv6 assignments.

Manfred-Knick avatar Mar 19 '24 00:03 Manfred-Knick

Notabene:

As long as the broken MNET_DHCP6 keeps blocking, even an IPv6-capable Fallback would not remedy the situation, agnostic of that being "fixed" or "dynamic" version.

The situation might be different if the primary connection provides "fixed" IPv6. Unfortunately, ATM, I do not have such a test bed available.

@emesterhazy: Thank you

@fichtner: You have marked #7202 for Milestone 24.7. Will an update for BE take till 24.10 then?

Manfred-Knick avatar Mar 19 '24 11:03 Manfred-Knick

Even worse:

Expecting a solid workaround by completely restraining to using the (fixed) IPv4 connection only: Interfaces: [MNet] : IPv6 Configuration Type: None Interfaces: [Test_IPv6] : IPv6 Configuration Type: None consistently deleting MNET_DHCP6 Gateway also.

Alas: DETACHING the cable between DSL-MoDem and OPNsense FW, the broken MNET_PPPOE IPv4 default route got removed, but, again, no default route was created at all; thus Fb_GWv4 (active) being correctly pulled up as highest priority now rests useless.

Neither "pluginctl monitor" nor waiting patiently for 7 minutes help.

Only after a REBOOT, the expected default route via Fb_GWv4 is available.

Very irritating: Lobby: Dashboard: Interfaces, "MNet" still is displayed as ^ (up) ! Interfaces: Overview: -> MNet interface (wan, pppoe1): . . . Status up . . . PPPoE up

Manfred-Knick avatar Mar 19 '24 11:03 Manfred-Knick

Heureka! A very minimalist Workaround, still needing manual interference, but . . . circumventing any need to re-boot, . . . circumventing any need to hack via ssh:

A) "Broken Primary Connection, switch to Fallback Connection"

A.0) Remove any broken IPv6 assignments from Internet6 Routing tables: Interfaces: Overview: MNet interface (wan, pppoe1): -> DHCP: "DHCPv6 up": -> "Release"

A.1) Create new default routing across Fallback Interface: System: Gateways: Single: -> Fb_GWv4 (active) -> "Edit": just -> "Save" System: Gateways: Single: -> "Apply changes"

B) "Primary Connection working again, switch-back from Fallback Connection"

B.1) Bring back MNET_PPPOE and re-create missing IPv6 (default) routing: Interfaces: MNet: just -> "Save" Interfaces: MNet: -> "Apply changes"

Tested with Intel® Ethernet Server Adapters I350 - T{2, 4} v2 : . . . via detaching the cable between DSL-MoDem and OPNsense FW . . . via re-attaching the cable between DSL-MoDem and OPNsense FW You can repeat ad libitum, If you like - without any reboot at all. Do not be irritated if pages like "wieistmeineip.de" need some time to recognize the new situation.

If no IPv6 is involved, (A.0) should be superfluous.

If dynamic IPv6 is involved, remember the prerequisites from above: 1.) Tickmark "Interfaces: [MNet]" -> "Request only an IPv6 prefix" 2.) "System: Gateways: Single" -> do not enter any (global) "Monitor IP" manually.

Both documentations should be upgraded to include these two non-obvious hints; concerning (1.), this was already suggested by @emesterhazy in #7202.

@fichtner: Franco, I hope this helps to further zero in on the real cause.

Thanks, and kind regards Manfred

Postscriptum: AFAICS, in case of dynamic IPv6 where the Provider grants an IPv6 prefix only, above cited "error" message The required MNET_DHCP6 IPv6 interface address could not be found is nothing to be wondered about: There is none, according to contract.

Manfred-Knick avatar Mar 19 '24 19:03 Manfred-Knick

@AdSchellevis : Really convinced of OPNsense BE for years now, recommending it's application for reliability and continuous development, please, understand humbly questioning labeling non-functionality of essential Firewall capability for "Community Support" - especially with respect to "Business Edition". Kind regards from Munich Respectfully yours Manfred

Manfred-Knick avatar Mar 20 '24 08:03 Manfred-Knick

@Manfred-Knick I don't mind changing or removing the label, but currently there's not enough relevant data to label this as something else than support. Sometimes strange timing issues occur due to hardware issues, (very)slow modules or half functional network cards. When needed we do offer commercial support as well to debug specific issues.

A workaround for strange hardware related issues is sometimes to add a start hook (https://docs.opnsense.org/development/backend/autorun.html#syshook) which triggers an ifconfig XXX down && ifconfig XXX up.

If https://github.com/opnsense/core/issues/7202 includes a fix (some code has been pushed), it should be possible to try with the master branch using the latest community release (and switch to development). When this fixes it, changes will eventually flow into the business edition as well.

AdSchellevis avatar Mar 20 '24 08:03 AdSchellevis

Further investigating multiple (IPv4) Fallback Gateways with descending GW priorities { 111 , 121 , ... }, applying (A) works - for the first fallback; but: "pulling the plug" on the first fallback GW (prio 111) as well, the following GW (prio 121, ...) does not get any default route; and (A.1) does not help any more.

System: Gateways: Single: -> <GW> -> "Edit": "Mark Gateway as Down" only 'marking' without 'pulling', I discovered that although MNET_PPPOE and MNET_DHCP6 had been marked as Down, it was still possible to "ping -6" e.g. heise.de or dns.isp.t-ipnet.de across a gateway being in status Offline (forced) ! These working ping only got disabled after physically 'pulling the plug'.

Sorry, Ad, denouncing tested, reliably-working server-grade equipment will not hold any water at all.

Confirmation: Utilizing multiple gateways in order to access different (static) net segments ( System: Routes: Configuration ) is easy and works reliable.

HTH All the best

Manfred-Knick avatar Mar 31 '24 13:03 Manfred-Knick

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.

OPNsense-bot avatar Sep 13 '24 14:09 OPNsense-bot