frr icon indicating copy to clipboard operation
frr copied to clipboard

zebra: zebra core with v6 RA

Open soumyar-roy opened this issue 7 months ago • 1 comments

Following core/BT was seen in internal code

Program terminated with signal SIGSEGV, Segmentation fault. [Current thread is 1 (Thread 0x7fcd750c9540 (LWP 30999))] (gdb) bt 0 0x00007fcd7596feec in ?? () from /lib/x86_64-linux-gnu/libc.so.6 1 0x00007fcd75920fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6 2 0x00007fcd75d008dc in core_handler (signo=11, siginfo=0x7ffd92dcb4f0, context=) at ../lib/sigevent.c:261 3 4 process_rtadv (arg=0x560287b66120) at ../zebra/rtadv.c:511 5 0x00007fcd75d1fa37 in wheel_timer_thread (t=) at ../lib/wheel.c:42 6 0x00007fcd75d13681 in event_call (thread=thread@entry=0x7ffd92dcbb60) at ../lib/event.c:2034 7 0x00007fcd75cbcb00 in frr_run (master=0x56028789ce00) at ../lib/libfrr.c:1242 8 0x0000560272e3945d in main (argc=14, argv=0x7ffd92dcbe88) at ../zebra/main.c:584 (gdb)

Paths to crash(Different occurrence): Interface uplink_2 got added to wheel timer 1st time, at end of rtadv_start_interface_events() 1)2025-06-07T05:01:23.802459+00:00 mlx-5600-33 zebra[229165]: [SEY8W-2M6VH] debug rtadv_start_interface_events, loc 2>>>>>::ifp::0x55a3281b1990::uplink_2

About each 1 sec, wheel timer process the interface uplink_2 Log from process_rtadv() 2025-06-07T05:01:29.870749+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:01:30.870767+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:01:31.870783+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:01:32.870794+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:01:33.870809+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:01:34.870836+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2

Now 2nd addition to wheel timer for same interface uplink_2 in rtadv_start_interface_events

if (adv_if != NULL) { rtadv_send_packet(zvrf->rtadv.sock, zif->ifp, RA_ENABLE); wheel_add_item(zrouter.ra_wheel, zif->ifp);<<<duplicate gets added return; /* Already added */ }

2)2025-06-07T05:03:44.642871+00:00 mlx-5600-33 zebra[229165]: [G63V5-AKC5D] debug in rtadv_start_interface_events, loc 1 >>>>>::ifp::0x55a3281b1990::uplink_2

Now, about each 1 sec, wheel timer process the interface uplink_2, twice back to back, which proves that indeed there are duplicate entries for uplink_2 in wheel timer Log from process_rtadv() 2025-06-07T05:03:44.878999+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:03:44.879076+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:03:45.879096+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:03:45.879169+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:03:46.879187+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2 2025-06-07T05:03:46.879240+00:00 mlx-5600-33 zebra[229165]: [H1EZX-2D8SA] debug <<>>>>>>::ifp::0x55a3281b1990::uplink_2

3)Now suppose the interface iuplink_2 s shutdown/removed, it will remove one instance for the interface from the wheel timer, another will still stay there 4)Interface uplink_2 memory is freed up 5)Now wheel timer tries to process uplink_2, it will crash

soumyar-roy avatar Jun 10 '25 21:06 soumyar-roy

ci:rerun

soumyar-roy avatar Jun 11 '25 20:06 soumyar-roy

are we sure this is right now? we did have the question about double-adds in an earlier round of this work

True, there was a concern about double add, but no practical way to prove this before. Even this current trigger/behavior is slightly different in upstream frr( especially with network manager restart, we don't get calls to if_up/if_down() in upstream frr, but we get in internal code), and I could not exactly reproduce the same signature in upstream frr, add/delete is getting balanced out with other triggers, but if there is a path to call rtadv_start_interface_events with adv_if != null, it should cause the issue too in frr. Current fix should remove this kind of any known/unknown trigger, that can cause this crash in future.

Also, I was modifying, wheel library before, to provide option, to check if item already exits already, before adding. We decided not to add that code, considering performance issue, for linear list walk in wheel timer.

soumyar-roy avatar Jun 16 '25 18:06 soumyar-roy

@Mergifyio backport dev/10.4

mjstapp avatar Jul 08 '25 14:07 mjstapp

backport dev/10.4

✅ Backports have been created

mergify[bot] avatar Jul 08 '25 14:07 mergify[bot]