net/frr: OSPF CARP interface costs don't survive a service restart
Important notices Before you add a new report, we ask you kindly to acknowledge the following:
- [x] I have read the contributing guide lines at https://github.com/opnsense/plugins/blob/master/CONTRIBUTING.md
- [x] I have searched the existing issues, open and closed, and I'm convinced that mine is new.
- [x] The title contains the plugin to which this issue belongs
Describe the bug The FRR Plugin has multiple ways to interact with carp.
- CARP failover mode
- CARP demote
- Influence interface cost based on CARP
This issue is about number 3.
If you choose a carp vip to depend on for an ospf interface, it works as long as the frr daemons are running. If a deamon which is in carp-state backup ends and starts again, it starts with the normal interface costs, not the demoted ones. This leads to same path costs and therefor routing problems. To get it working again, you have to move the vips, then the costs get corrected for both firewalls.
Also if you manually trigger python3 /usr/local/opnsense/scripts/frr/carp_event_handler the correct costs are getting applied.
To Reproduce Steps to reproduce the behavior:
- Use 2 OPNSense Firewalls in HA
- Create an interface in OSPF and set a Carp VIP to track [depend on (carp)], also set interface costs and demoted costs
- Move VIPs to second firewall (costs get the demoted ones on first firewall which is now in carp-state BACKUP)
- Reboot first firewall which is now in carp-state BACKUP
- After a reboot, the rebooted backup firewall gets the default costs of the interface and not the demoted ones
Expected behavior The firewall in state carp-backup should get the demoted costs after a reboot (or a service restart) and not the default ones.
Screenshots none
Relevant log files
Error on system startup:
>>> Error in start script '50-frr'
Additional context Interestingly, this log line never appears.
Environment OPNsense 25.1.6_4 (amd64). Virtual testing appliance
The error on system startup is not the cause. It is because of this startup-hook, which i think can be removed, but it doesn't matter in that case.
~~The error for this issue is, that the override of start_postcmd is not working anymore.~~
Maybe someone has an idea how to solve this in an elegant way?
@AndyX90 I suppose you mean the script parts inside postcmd, when I place an echo at that spot, it does output on start.
@AdSchellevis You are totally right, the override works. I did the same test but i must have overseen the logentry. In its current form, "Starting CARP event handler now" never appears, but carp_event_handler is fired 9 times at startup on my side.
Also the relevant log line is present in frr: ospfd demote interface vtnet2 (cost 1000). but it has no effect. Maybe it gets overridden afterwards.
2025-05-15T17:50:05+02:00 fw.localdomain zebra 97570 - [QS0NJ-H5QKJ] Zebra final shutdown 2025-05-15T17:50:28+02:00 fw.localdomain frr_carp 58103 - FRR received carp configuration event 2025-05-15T17:50:29+02:00 fw.localdomain frr_carp 47538 - FRR received carp configuration event. 2025-05-15T17:50:29+02:00 fw.localdomain frr_carp 47538 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:29+02:00 fw.localdomain ospfd 47422 - [YWPB2-VEAQY] ASBR[default:Status:1]: Update 2025-05-15T17:50:29+02:00 fw.localdomain zebra 46576 -[VTVCM-Y2NW3] Configuration Read in Took: 00:00:00 2025-05-15T17:50:29+02:00 fw.localdomain ospfd 47422 - [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00 2025-05-15T17:50:30+02:00 fw.localdomain ospfd 47422 - [S5PCG-77H23] Packet[DD]: Neighbor 192.168.122.92 Negotiation done (Master). 2025-05-15T17:50:30+02:00 fw.localdomain frr_carp 58103 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:30+02:00 fw.localdomain frr_carp 58103 - ospfd demote interface vtnet2 (cost 1000). 2025-05-15T17:50:31+02:00 fw.localdomain frr_carp 51073 - FRR received carp configuration event. 2025-05-15T17:50:31+02:00 fw.localdomain frr_carp 51073 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:31+02:00 fw.localdomain frr_carp 56837 - FRR received carp configuration event. 2025-05-15T17:50:31+02:00 fw.localdomain frr_carp 56837 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 64780 - FRR received carp configuration event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 64780 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 70775 - FRR received carp configuration event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 70775 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 76358 - FRR received carp configuration event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 76358 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 82511 - FRR received carp configuration event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 82511 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 89237 - FRR received carp configuration event. 2025-05-15T17:50:32+02:00 fw.localdomain frr_carp 89237 - FRR trigger OspfdEventHandler event. 2025-05-15T17:50:34+02:00 fw.localdomain watchfrr 37123 - [QDG3Y-BY5TN] zebra state -> up : connect succeeded 2025-05-15T17:50:34+02:00 fw.localdomain watchfrr 37123 - [QDG3Y-BY5TN] ospfd state -> up : connect succeeded 2025-05-15T17:50:34+02:00 fw.localdomain watchfrr 37123 - [KWE5Q-QNGFC] all daemons up, doing startup-complete notify 2025-05-15T17:50:34+02:00 fw.localdomain kernel - <118>WARNING: Old rc.d/watchfrr detected, this file must be deleted 2025-05-15T17:50:34+02:00 fw.localdomain kernel - <118>Checking intergrated config... 2025-05-15T17:50:34+02:00 fw.localdomain kernel - <118>watchfrr already running? (pid=37123). 2025-05-15T17:50:34+02:00 fw.localdomain kernel - <118>>>> Error in start script '50-frr' 2025-05-15T17:50:34+02:00 fw.localdomain zebra 46576 - [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00 2025-05-15T17:50:34+02:00 fw.localdomain ospfd 47422 - [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00 2025-05-15T17:50:34+02:00 fw.localdomain ospfd 47422 - [JPMW2-G68GC] Zebra[Redistribute]: distribute-list update timer fired! 2025-05-15T17:50:34+02:00 fw.localdomain zebra 46576 - [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00 2025-05-15T17:50:34+02:00 fw.localdomain ospfd 47422 - [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Summarizing things up: Our carp startup logic comes from a time where watchfrr was not used and the routing daemons were started directly through rc. But nowadays watchfrr is enabled by default (and recommended) and handles the daemons. More information on actual startup behavior: https://cgit.freebsd.org/ports/tree/net/frr8/files/frr.in#n15 https://cgit.freebsd.org/ports/tree/net/frr8/files/watchfrr.in#n25
Noticed 2 independent cosmetics btw:
- failing startup hook introduced in https://github.com/opnsense/plugins/commit/f26a704f9ee033e542f609dc716a2119374543df I think this can be removed nowadays?
- frr complains about an old watchfrr file
For reference regarding the old watchfrr file, it should be removed automatically when the setup.sh is called:
https://github.com/opnsense/plugins/pull/4552
Probably not related to the original issue but I am also seeing Error in start script '50-frr' and also WARNING: Old rc.d/watchfrr detected, this file must be deleted. I am not using CARP. Just OSPF between a few routers.
This issue has been automatically timed-out (after 180 days of inactivity).
For more information about the policies for this repository, please read https://github.com/opnsense/plugins/blob/master/CONTRIBUTING.md for further details.
If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.