core
core copied to clipboard
IPsec very unstable in IPv6 when going from 24.1.8 to 24.1.9_4
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
- [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
- [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue
Describe the bug
6 VPN tunnels between 3 sites for both IPv4 and IPv6 has been up working flawlessly for many weeks in 24.1.8.
Problems started when going to 24.1.9_4 this morning. No changes made to any IPsec or firewall setting, but the update seem to create a mismatch between address families in the firewall rule for ISAKMP (port 500) for IPv6, thus mixing IPv4 with IPv6 peers, see Relevant log files below.
To Reproduce
Occurs after boot or restart under "Status Overview". It is impossible to get the 3x IPv6 tunnels all open and they will close again if they succeed to open for a short while.
IPv4 tunnels are not affected.
Expected behavior
Tunnels open in both IPv4 and IPv6, sometimes with restart in IPv6 after boot, but once open they stays open like they did in 24.1.8.
Describe alternatives you considered
I consider revert to 24.1.8 if possible, but have not done anything yet. I tested applying firewall rules again, just through saving them not remove and add again.
Screenshots
Not applicable.
Relevant log files
There were error(s) loading the rules: /tmp/rules.debug:131: no routing address with matching address family found. - The line in question reads [131]: pass in log on igc0 reply-to ( igc0 2.242.xxx.x ) proto udp from {2a07:3aa1:x::xx} to {any} port {500} keep state label "00eff9b1ada77af37818877b66bca707" # IPsec: Site1_Site2_IPV6
Additional context
Not what I can think about, no!
Environment
Software version used and hardware type if relevant, e.g.:
Topton 16GB DDR4 256GB NVMe, N100 i226-V DDR5 OPNsense 24.1.9_4 (amd64). Intel® N100 (4 cores, 4 threads) Network Intel® i226-V
doesn't look like anything changed here in this version, but the rule suggests a protocol mismatch between the legacy phase 1 entry and the remote gateway address.
https://github.com/opnsense/core/blob/57184b24e6591be09d438a19351b2cc8ded749c0/src/etc/inc/plugins.inc.d/ipsec.inc#L272
When using dynamic hostnames and the automatic rules don't function as expected, you could also consider using manual rules and disable the automatic rules (se ipsec advanced settings)
Still I don't understand why this is happening now since nothing in the configuration has changed. There is no dynamic entry in IPv6 only static /128 peers on WAN and static /64 subnets on LAN. There is one dynamic host name on IPv4, which is an A record in DNS, could that affect?
The ISAKMP port 500 firewall rule is 1 out of 10 a manual WAN rules for IPsec. Indeed there is also 16 automatic WAN rules.
All 3 sites are dual stack, 2 sites are native IPv6 whereas the 3:rs is through HE tunnel broker. It is IPv4 peer on the tunnelbroker site that is a dynamic.
best check the phase one settings first, my assumption would be that remote host is a dns entry which doesn't resolve on ipv4 (anymore?)
The single dynamic host resolved and the IPv4 tunnels are all open. I have reached all 3 OPNsense routers all the time through local subnets in IPv4.
Still I removed the dynamic host and entered the IPv4-address directly, but it makes no difference. It also looks like all IPv6 tunnels are open in status overview currently, but ping doesn't work and one router already dropped WAN IPv6 address, so I rebooted it again.
Strange indeed!
Right now after second boot on one router I can ping between all 3 sites, lets leave it for a while ans see if it lasts this time...
No I am afraid it broke again and one native router lost the IPv6 WAN peer. The other native seem to loos its WAN Ipv6 peer too, but it takes longer before it is dropped.
how is Site1_Site2_IPV6 configured? specifically Remote gateway and Internet Protocol are relevant in this case.
The GUI clips are from the router that throw the exception above.
I split Phase 1 in two parts A and B due to its size. Both peers are WAN /128 IPv6. Phase two is IPv6 LAN subnet /64 . The other site is the same but with swapped peers and subnets. This has been the same all the time and worked well in 24.1.8. I only swap "Peer IP" to the actual IP-adress from the remote gateway, but that was after it become unstable and actual IP or "Peer IP" doesn't made any difference.
the Site1_Site2_IPV6 is the tunnel triggering the error (and preventing the ruleset to load)
There is similar exceptions on both native sites:
There were error(s) loading the rules: /tmp/rules.debug:144: no routing address with matching address family found. - The line in question reads [144]: pass in log on igc0 reply-to ( igc0 WAN IPv4 adress site 2) proto udp from {WAN IPv6 adress site 1} to {any} port {500} keep state label "298e495261c4df8933254da07a2a6412" # IPsec: Site1_Site2_IPV6
There were error(s) loading the rules: /tmp/rules.debug:131: no routing address with matching address family found. - The line in question reads [131]: pass in log on igc0 reply-to ( igc0 WAN IPv4 adress site 1 ) proto udp from {WAN IPv6 adress site 2} to {any} port {500} keep state label "00eff9b1ada77af37818877b66bca707" # IPsec: Site1_Site2_IPV6
There is no exception on the tunnel broker site!
Is the commit a modification to source code?
Is the commit a modification to source code?
yes, but most likely to prevent misconfigurations to generate faulty rules (which prevent the firewall loading the configuration). The (exact) settings of Site1_Site2_IPV6 matter in this case...
Yeah OK, but in what way shall I change the settings that obviously worked before. Internet protocol have to be IPv6 and Remote gateway has to be he opposite peer, the WAN IPv6, right.
By the way both native sites has now lost the WAN IPv6 address used for the tunnels. These are normally visible in the bottom right corner of the dashboard . The native sites are with leases from different ISP's, so the loss is hardly caused by the ISP end.
One IPv6 is locked with a DUID in agreemnet with ISP and the other site has never changed anyway.
I have never seen this "loss of WAN IP" behavior in 24.1.8. There could be a slighte delay in having IPv6 after boot, but once there solid as a rock in 24.1.8.
as mentioned earlier, code seems to be the same on the IPsec part, sharing the settings of Site1_Site2_IPV6 might help debug your issue, if the rules.debug errors are causing your issue, a misconfiguration is likely. Very often issues like these are caused by the reboot after the upgrade triggering an issue that was already there.
Okey I believe I have booted many times before though, but you never know of course.
Shall I take a backup xml and clean it from sensitive info and upload it here?
a screenshot of the Site1_Site2_IPV6 phase one settings might already be enough.
Well the "Site1_Site2_IPV6" is actually identical to Description "Ekas_Skrea_IPv6" already pasted above, but with mostly hidden IP and hidden PSK. Both Phase 1 and Phase 2 is shown, not good enough?
I would expect internet protocol being set to ipv4.... hence the question
Yeah OK, but the 3x IPv4 VPN-tunnels between the 3 sites are already working just fine.
The intention is also 3x IPv6 VPN tunnels. This is required to make domain controllers and servers to work correctly in the same domain present on the 3 sites, when having dual stack.
I thought I must use IPv6 internet protocol when building IPv6 VPN-tunnels to enable local access between subnets, but maybe I got that wrong?
🤷♂️
Just to avoid misunderstandings the 3x IPv4 + 3x IPv6 VPN-tunnels all worked just fine in 24.1.8 with the current settings. I have not changed any settings or added anything to the configuration.
I need to correct IPv4 reference in exceptions that I wrote above on both native sites, these are actually not WAN IPv4 on my router, but instead the IPv4 gateway IP-address at the ISP paired with the WAN IPv6 on the opposite site:
Exception site 1 There were error(s) loading the rules: /tmp/rules.debug:131: no routing address with matching address family found. - The line in question reads [131]: pass in log on igc0 reply-to ( igc0 ISP gateway IPv4 adress site 1 ) proto udp from {WAN IPv6 adress site 2} to {any} port {500} keep state label "00eff9b1ada77af37818877b66bca707" # IPsec: Site1_Site2_IPV6
Exception site 2: There were error(s) loading the rules: /tmp/rules.debug:144: no routing address with matching address family found. - The line in question reads [144]: pass in log on igc0 reply-to ( igc0 ISP gateway IPv4 adress site 2) proto udp from {WAN IPv6 adress site 1} to {any} port {500} keep state label "298e495261c4df8933254da07a2a6412" # IPsec: Site1_Site2_IPV6
Exception site 3: There is no exception on the tunnel broker site and no entries in the general firewall log! The firewall rules are applied to opt1 instead of wan.
There is a couple of circumstances that may be important
-Exceptions exist long before 21.1.9_4 and have been in the general firewall log for 3 months, so it may not be related to the instability at all, or the error had no impact in earlier versions.
-There is no ISP gateway IPv4 adresses in any configuration file, these comes from the ISP broadcast.
-There are 4 VPN tunnels configured on each site, 2x IPv4 and 2x IPv6.
-There are no IPv4 entries in Phase 1 or Phase 2 of IPv6 tunnels, just
The xml tag in the text was converted to text weirdly in last bullet about ISAKMP för Ipv6; ipprotocol is inet6 port is 500, source and destination is any...
I tried defining source and destination IP-address for the IPv6 ISAKMP firewall rule as well as for other IPv6 WAN firewall rules. I also tried using other IPv6 gateways than default, e.g. loopback ::1 and link local ISP gateway address in the firewall rule GUI. Exception still thrown and no opening of VPN-tunnel under these settings. Reverted back to old firewall setting.
It is possible to open the tunnels, with some manual restart in Status Overview after boot and ping between IPv6 subnet works for a while. The exception is probably only causing one rule not to load, but has been there ove several versions.
The tunnels fails again when the public WAN IPv6 address is dropped after a few minutes, which seems like the most important problem to solve to keep the tunnels alive.
Reverting to 24.1.8 on opnsense and dhcp6c like described in the commuinty made the WAN IPv6 address persist, and finally opened the IPv6 VPN tunnels permanently.
The exception with the faulty firewall rule that was not loaded (see next sentence) had no impact. There were error(s) loading the rules: /tmp/rules.debug:131: no routing address with matching address family found...". Still it is strange that OPNsense is trying to apply such a rule.
It is tricky to get all 3 tunnels open and booting is required several times, to get MTU 1480 on the Hurricane Electric interface. I heard that GIF defaults to 1280. The maximum ping payload is 1295, due to losses in tunnels when 1480 is applied and only 1095 if 1280 is applied. A 1095 payload is to little for IPv6 packets to travel between a native IPv6 and a tunnel broker site and RPC traffic fails for domain controllers. Communication works when payload is at 1295. Tunnels between two native IPv6 sites always work fine.
I understood there will be no hotfix for 24.1.9 to correct the bug with the lost WAN IPv6 address. Maybe will all these problems described be solved in the next release?
This issue has been automatically timed-out (after 180 days of inactivity).
For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.
If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.