core icon indicating copy to clipboard operation
core copied to clipboard

Enter Persistent CARP Maintenance Mode doesn't do anything

Open IB-Rahn opened this issue 2 years ago • 7 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

When clicking the button under Interfaces: Virtual IPs: Status : "Enter Persistent CARP Maintenance Mode" the actual advskew doesn't change. The master still has a advskew of 0. Under Current CARP demotion level it displays 240 in the WebGUI. Normal failover (e.g. firewall goes down) works immediately.

To Reproduce

Steps to reproduce the behavior:

  1. Go to CARP Status
  2. Click on Enter Persistent CARP Maintenance Mode
  3. See current carp demotion changing from 0 to 240
  4. Master stays on the first FW and ifconfig | grep carp outputs advskew is still 0

Expected behavior

CARP Master should move from first FW to second FW where advskew is 100.

Describe alternatives you considered

I already recreated every CARP VIP since this installation has been trough some upgrades.

Screenshots

If applicable, add screenshots to help explain your problem.

Relevant log files Output from net.inet.carp when buttin is untoggled: sysctl net.inet.carp net.inet.carp.ifdown_demotion_factor: 240 net.inet.carp.senderr_demotion_factor: 240 net.inet.carp.demotion: 0 net.inet.carp.log: 1 net.inet.carp.preempt: 1 net.inet.carp.dscp: 56 net.inet.carp.allow:

Output from net.inet.carp when button is toggled: sysctl net.inet.carp net.inet.carp.ifdown_demotion_factor: 240 net.inet.carp.senderr_demotion_factor: 240 net.inet.carp.demotion: 240 net.inet.carp.log: 1 net.inet.carp.preempt: 1 net.inet.carp.dscp: 56 net.inet.carp.allow: 1

Outpot from ifconfig when button is toggled and untoggled: ifconfig | grep carp carp: MASTER vhid 10 advbase 1 advskew 0 carp: MASTER vhid 110 advbase 1 advskew 0 carp: MASTER vhid 111 advbase 1 advskew 0 carp: MASTER vhid 112 advbase 1 advskew 0 carp: MASTER vhid 170 advbase 1 advskew 0 carp: MASTER vhid 171 advbase 1 advskew 0 carp: MASTER vhid 172 advbase 1 advskew 0

Additional context

In the past I tried fixing this problem with a patch from #3671 and it worked before. But this was when we still had bare metal FW and since we moved to a virtual environment this issue has been swept under the rug.

Environment

Software version used and hardware type if relevant, e.g.:

Both FW are virtualized under Hyper-V OPNsense 22.7_4-amd64FreeBSD 13.1-RELEASEOpenSSL 1.1.1q 5 Jul 2022 Intel® Xeon™ E3-1220V6 3.0Ghz Quad Core Hyper-V Network Adapter

IB-Rahn avatar Aug 04 '22 11:08 IB-Rahn

When both Systems are fine both have demotion of 0, so why is this a problem just increasing demotion?

mimugmail avatar Aug 04 '22 16:08 mimugmail

When both Systems are fine both have demotion of 0, so why is this a problem just increasing demotion?

They both are fine and have a demotion of 0 but I can't make a controlled switchover when e.g. I want to update the master fw.

IB-Rahn avatar Aug 05 '22 11:08 IB-Rahn

Usually you have both at 0 and since FW1 has skew if 0 it's preferred master. Update process looks like this:

  • Update of FW2
  • When up again, check services, features, states
  • Wait a couple of minutes so FW2 receives most open states
  • Set FW1 in maintenance mode, demotion gets +240 so FW2 is forced master (will survive a reboot)
  • See how network runs on new firmware, up to you, for 10 minutes or 2 days
  • No complains? Update FW1 and when on same version as FW2 wait again couple of minutes for states
  • Leave mnt mode on FW1

This has proved to be quite stable for very long time, very often for quite amount of customers.

mimugmail avatar Aug 05 '22 14:08 mimugmail

That's how I've read about a smooth update process before and also how I want to do it. But the problem is that at point 4 (Set FW1 in maintenance mode) nothing happens. I see the demotion value of 240 in the WebGUI but the actually advskew doesn't change. So FW1 will always be master. It switches over when I reboot it anyway because the interfaces are going down, but this isn't a controlled way like you described.

IB-Rahn avatar Aug 08 '22 12:08 IB-Rahn

This behaviour has nothing to do with the advskew, to me it looks like the switch is doing igmp snooping.

mimugmail avatar Aug 08 '22 12:08 mimugmail

I can't find anything about the behaviour of the button in the opnsense manual. But from the pfsense manual I got this:

The next button toggles CARP maintenance mode. In maintenance mode the VIP configuration remains on the interfaces and a node participating in CARP demotes itself naturally by increasing the advertising frequency skew of its VIPs to the maximum value, 254. This allows other CARP nodes to take over the MASTER role naturally.

Sets the skew of all VIPs to 254 and sets the maintenance mode flag in the firewall configuration. If this flag is present in the configuration at boot time, the node will remain in maintenance mode.

That's why I assumed the button just changes the adsvkew to make the switch over. But you guess that the FW1 can't reach the FW2 while igmp snooping?

IB-Rahn avatar Aug 08 '22 13:08 IB-Rahn

Then pfsense might changed that functionality after the fork couple of weeks/months/years ago. Changes at pfsense arent tracked here. Personally I never had an issue with just setting a higher demotion.

So, when you set FW1 to maintenance mode .. do a tcpdump on both firewalls at the specific interface. If on FW2 you still receive CARP packets and FW2 itself doesn't send anything the FW2 seems to have a problem (a demotion value higher than 0) for unspecified reason (=reboot), or CARP was temporary disabled on FW2, or FW2 sends out CARP packets but FW1 didn't receive them cause the switch intercepts them and doesn't flood them out (igmp snooping).

mimugmail avatar Aug 08 '22 13:08 mimugmail

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.

OPNsense-bot avatar Jan 31 '23 11:01 OPNsense-bot

Hi @mimugmail, let me know if I should re-open a new ticket for this.

We are in a half-setup way of doing HA using CARP and we encountered an issue: We setup Virtual CARP IP while having the maintenance mode ON (persistent) and when we reboot, the firewall thought it was master (so the maintenance mode was OFF).

Does it means that the Persistent CARP Maintenance Mode doesn't survive a reboot ? Or was it a bug ? We're using OPNsense 22.10.2-amd64 business edition

Nono-m0le avatar Mar 09 '23 10:03 Nono-m0le

@Nono-m0le that would be odd, persistent stores it's state in the configuration (virtualip_carp_maintenancemode), you should see a change in the config history of this event as well. By my knowledge there haven't been bugs on this subject for a long time and it's quite frequently used by many.

https://github.com/opnsense/core/blob/stable/22.7/src/www/carp_status.php#L37-L55

AdSchellevis avatar Mar 09 '23 11:03 AdSchellevis

thanks @AdSchellevis

So here is the status: The maintenance was ON (<virtualip_carp_maintenancemode>1</virtualip_carp_maintenancemode>) and still was after the reboot (no change on the config history) BUT, The virtual IPs we configured on this instance (the only one configured so far) took his role of master, and we assumed that with the maintenance mode ON, it shouldn't do that (aka, not use / enable the virtual IP) which it did anyway.

Nono-m0le avatar Mar 09 '23 11:03 Nono-m0le

@Nono-m0le that's usually a switching issue, not related to the firewall

AdSchellevis avatar Mar 09 '23 11:03 AdSchellevis

@AdSchellevis can you please explain me further what you mean by that ?

I'm just wondering why it's possible that the Virtual IPs are in "MASTER"-mode if the "CARP Maintenance Mode" is active. In my opinion, this should never be possible.

Nono-m0le avatar Mar 09 '23 11:03 Nono-m0le

@Nono-m0le Maintenance mode just demotes the node, which makes it backup if there is another node more important. it does not "force" anything to backup nor does the carp protocol support this (as it's not needed for a functional setup). Most issues have todo with misconfigured (virtual) switches. It's certainly not a bug.

AdSchellevis avatar Mar 09 '23 11:03 AdSchellevis

I'm having a similar issue that might be switched related (or virtual switch related) as I'm running two VMs with OPNSense on a Nutanix cluster. Unfortunately changing into and out of maintenance mode does nothing. I've been following several threads on this one and this thread is the most likely however there was no explaination as to "what" misconfiguration to look for (real or virtual switch) and how that misconfiguration might affect the problem.

What I do know is that the Nutanix cluster relies on OpenVSwitch under the hood prior to connections to actual switches. Doing an omping between two VMs in the cluster results in a unicast success and multicast ping failure, I don't know what else to look for beyond that.

mdella-nutanix avatar Dec 13 '23 15:12 mdella-nutanix