frr icon indicating copy to clipboard operation
frr copied to clipboard

pim6d: CPU usage increase too much with multi payload

Open Pengwei-Chen-1 opened this issue 8 months ago • 6 comments

Description

Image

In our environment, we use the IPv6 multicast:

  1. The multicast sender in VM1, we enabled ipv6 pim
  2. The Switch also enabled IGMPv3 & MLDv2 & pim ipv6 & ospfv3
  3. The multicast receiver join (S,G): (2001:72:101::94:16,ff35:94::1)

And the problems are:

  1. When we only start several multicast and with little palyload, it works well.
  2. But after we start 1000 multicast with 160 bytes with 20ms frequency, the CPU usage increase more, about 15%.
  3. Same multicast number and playload in IPv4 is no problem, the CPU usage less than 1%.
  4. Our product playload is bigger than this, and the CPU usage also very huge.

Image

Version

FRRouting 10.0.1 (mcptt-cp) on Linux(5.14.0-427.57.1.el9_4.x86_64).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
    '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sbindir=/usr/lib/frr' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-static' '--disable-werror' '--enable-multipath=256' '--enable-vtysh' '--enable-ospfclient' '--enable-ospfapi' '--enable-rtadv' '--enable-ldpd' '--enable-pimd' '--enable-pim6d' '--enable-pbrd' '--enable-nhrpd' '--enable-eigrpd' '--enable-babeld' '--enable-vrrpd' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-fpm' '--enable-watchfrr' '--disable-bgp-vnc' '--enable-isisd' '--enable-rpki' '--enable-bfdd' '--enable-pathd' '--enable-snmp' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig' 'CC=gcc' 'CXX=g++' 'LT_SYS_LIBRARY_PATH=/usr/lib64:'

How to reproduce

  1. Setup the FRR with PIMv6.
  2. Start to join the IPv6 (S,G)
  3. Top command check the CPU usage.

Expected behavior

The pim6d CPU usage should be similar with pimd process CPU usage.

Actual behavior

The pim6d CPU usage too big.

Additional context

I registe this bug https://github.com/FRRouting/frr/issues/16071 before, but after we upgrade the OS kernel to "5.14.0-427.57.1.el9_4.x86_64", the error log not reproduce. But the CPU usage still not normal.

Checklist

  • [x] I have searched the open issues for this bug.
  • [x] I have not included sensitive information in this report.

Pengwei-Chen-1 avatar Mar 20 '25 05:03 Pengwei-Chen-1

Can we get a flamegraph of what pim6d is doing at the time it's usage is high? https://github.com/FRRouting/frr/wiki/Perf-Recording

donaldsharp avatar Mar 20 '25 16:03 donaldsharp

I installed the perf and frr-debuginfo, run below command to get the files:

perf record -g --call-graph=dwarf -p 21969 -- sleep 10
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > pim6d_debug_flamegraph.svg

perf.zip

Pengwei-Chen-1 avatar Mar 24 '25 08:03 Pengwei-Chen-1

I just installed the kernel-debuginfo-5.14.0-427.57.1.el9_4.x86_64.rpm package and test it again

perf record -g --call-graph=dwarf -p 777 -- sleep 30
perf script > out.perf
./stackcollapse-perf.pl out.perf | ./flamegraph.pl > pim6d_debug_flamegraph.svg
mv perf.data perf_ipv6_with_debug.data
tar -czvf /tmp/perf_data_2025_03_25_02_ipv6.tar.gz perf_ipv6_with_debug.data out.perf pim6d_debug_flamegraph.svg

perf_data_2025_03_25_02_ipv6.tar.gz

Pengwei-Chen-1 avatar Mar 25 '25 01:03 Pengwei-Chen-1

hi @donaldsharp Did you check the perf data, if you have any finding, pls let me know. Thanks~~

Pengwei-Chen-1 avatar Apr 01 '25 02:04 Pengwei-Chen-1

Hi @donaldsharp is there any other data that you need, that would help with the issue?

mruprich avatar May 06 '25 08:05 mruprich

@Pengwei-Chen-1 Hi, did you by any chance try to produce the flame graph for IPv4 as well? I know that the code is different but the network and multicast handling stuff could be similar. The difference could help pinpoint the issue.

mruprich avatar May 19 '25 09:05 mruprich

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

github-actions[bot] avatar Nov 16 '25 02:11 github-actions[bot]

This issue will be automatically closed in the specified period unless there is further activity.

frrbot[bot] avatar Nov 16 '25 02:11 frrbot[bot]

Hi @donaldsharp do you have any tips what debugging steps could be done next here?

mruprich avatar Nov 18 '25 06:11 mruprich

This issue will no longer be automatically closed.

frrbot[bot] avatar Nov 18 '25 06:11 frrbot[bot]