frr icon indicating copy to clipboard operation
frr copied to clipboard

FRR 9.1 Thread Starvation Logs

Open SwimGeek opened this issue 1 year ago • 4 comments

Describe the bug

This is more of a feature suggestion than a bug report.

I noticed 'Thread Starvation' warnings in FRR logs after updating BGP local-pref values. vtysh locks up / becomes unresponsive.

Two ideas:

  1. Would it be possible to to update the Thread Starvation warnings to make it more obvious where the problem is. I suspect it's related to RPKI.

  2. Related to earlier Issue #15033 - I was wondering if it makes sense to do the RPKI processing in a different thread, so more CPU cores can be used.

Log entries included below.

  • [x] Did you check if this is a duplicate issue?
  • [ ] Did you test it on the latest FRRouting/frr master branch?

To Reproduce

I updated a route-map and set a local-pref value.

Expected behavior

I was not expecting to see warning messages in the logs as this is a fairly basic change. vtysh should ideally not lock up.

Versions

  • OS Version: Debian 11
  • Kernel: 5.10.205
  • FRR Version: 9.1 - with 8 core 2.3GHz xeon CPU

Additional context

Log messages below:

Jan 14 05:26:50 cpt-ter-r1 bgpd[766]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x55fdb6a16200 arg=0x0 timer  r=-14.603    bgp_sync_label_manager() &bm->t_bgp_sync_label_manager from ../bgpd/bgp_labelpool.c:576} was scheduled to pop greater than 4s ago
Jan 14 05:27:05 cpt-ter-r1 bgpd[766]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x55fe32d6c470 arg=0x55fdb68b18a0 timer  r=-8.733     (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_io.c:152} was scheduled to pop greater than 4s ago
Jan 14 05:27:25 cpt-ter-r1 bgpd[766]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x55fe32d98ba0 arg=0x55fdda28f1c0 timer  r=-10.125    update_subgroup_merge_check_thread_cb() &subgrp->t_merge_check from ../bgpd/bgp_updgrp.c:1463} was scheduled to pop greater than 4s ago
Jan 14 05:28:35 cpt-ter-r1 bgpd[766]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x7f8d648784b0 arg=0x55fdc5c62da0 timer  r=-47.636    (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_io.c:152} was scheduled to pop greater than 4s ago
Jan 14 05:58:26 cpt-ter-r1 bgpd[766]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x7f8d641a7650 arg=0x55fddf559f70 timer  r=-15.115    (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_fsm.c:962} was scheduled to pop greater than 4s ago

SwimGeek avatar Jan 14 '24 04:01 SwimGeek

Unfortunately with the way the event system works there is no good way to figure out what is causing the issue without developer intervention. show thread cpu before and a show thread cpu afterwords. Then a developer can spend some time figuring out what is going on. Additionally an operator can gather perf data when these errors are seen and we can then figure out what is causing theissue to be addressed

donaldsharp avatar Jan 16 '24 16:01 donaldsharp

Hi

And the idea to move some of the RPKI load to extra cpu core?

SwimGeek avatar Jan 16 '24 16:01 SwimGeek

Using FRR 9.1, I was also observing thread starvation for t_generate_updgrp_packets on an initial update to the neighbor:

2024/02/07 14:06:56 BGP: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x55fdad4713a0 arg=0x55fda6fb5780 timer  r=-14.845    (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_fsm.c:965} was scheduled to pop greater than 4s ago

I found that this problem was introduced by a7a7fa57fe1f7a018438433d6b55d2c633b1fe82 which changes bgp_advertise_add() from O(1) time complexity to O(n). Addition of new bgp_advertise items to the list started to require iteration to the list tail. The more items were added, the more time it took. Then I discovered that bgp_advertise_add() looked differently in master because of b2e0c12d723a6464f67491ceb912225f35c9427a. With that patch, bgp_advertise_add() works O(1) time again because an optimized version of the list is used now. The tail is stored and no iteration to find it is no longer needed.

When I applied b2e0c12d723a6464f67491ceb912225f35c9427a, I stopped observing the problem. So I think that change needs to be released soon.

achernavin22 avatar Feb 09 '24 08:02 achernavin22

It will be released under 10.0.

ton31337 avatar Feb 09 '24 10:02 ton31337

I have a very simple BGP setup with 1 Full table, 2 Cymru Bogons feed and 1 iBGP sessions, both for IPv4 and IPv6. In FRR 9.1 it is not possible to get this working. I end up with the Thread Starvation errors or crashes as mentioned above. When using b2e0c12 it works. Is there really no way this will be merged in 9.1 to make it usable?

roedie avatar Mar 26 '24 12:03 roedie

I've asked to backport https://github.com/FRRouting/frr/pull/14554 to the 9.0 and 9.1 branches to make the list fix available there.

mjstapp avatar Mar 26 '24 15:03 mjstapp