frr icon indicating copy to clipboard operation
frr copied to clipboard

nhrpd: fixes core dump on shutdown

Open dleroy opened this issue 9 months ago • 11 comments

When nhrpd is shutdown via nhrp_request_stop() the shutdown sequence was not handling the case where there are active shortcut routes installed. The zebra client and shortcut rib were being cleaned up before vrf_terminate() had an opportunity to delete the active routes.

core.txt

dleroy avatar Apr 29 '24 16:04 dleroy

would be nice to have a topotest for this

louberger avatar Apr 29 '24 17:04 louberger

Having a coredump inside the commit would be helpful too.

ton31337 avatar Apr 29 '24 19:04 ton31337

Having a coredump inside the commit would be helpful too.

core.txt link has the backtrace. Next time I'll include in line.

dleroy avatar Apr 29 '24 20:04 dleroy

Having a coredump inside the commit would be helpful too.

core.txt link has the backtrace. Next time I'll include in line.

Why not this time?

ton31337 avatar Apr 30 '24 03:04 ton31337

@Mergifyio backport stable/10.0 stable/9.1 stable/9.0 stable/8.5

ton31337 avatar Apr 30 '24 11:04 ton31337

backport stable/10.0 stable/9.1 stable/9.0 stable/8.5

✅ Backports have been created

mergify[bot] avatar Apr 30 '24 12:04 mergify[bot]

Having a coredump inside the commit would be helpful too.

core.txt link has the backtrace. Next time I'll include in line.

Why not this time?

Hi Donatas, I'd be happy to add coredump and associated nhrpd. I looked for a previous pull request example of best practices how to do that and couldn't find another PR with a coredump. Can you suggest how FRR community would like that done. Thanks.

dleroy avatar Apr 30 '24 15:04 dleroy

The zebra client and shortcut rib were being cleaned up before vrf_terminate() had an opportunity to delete the active routes.

(This is non-blocking re. this PR) — we may have a more general problem here with ordering shutdown operations. vrf_init is called before any NHRP initialization (not counting nhrp_error_init), and normally the shutdown calls should be in reverse order to init calls. If we need to clear out routes first, don't do that, and that causes problems… something is wrong with our setup of operations.

eqvinox avatar May 02 '24 08:05 eqvinox

The topotest added by this commit is failing on Ubuntu 18.04 arm8 because 'iptables' is not installed on the image. The test leverages 'iptables' in order to be able to test nhrp shortcuts

2024/05/02-00:05:51 2024-05-02 00:05:51,225 INFO: topo: input: iptables -A FORWARD -i r1-gre0 -o r1-gre0 -m hashlimit --hashlimit-upto 4/minute --hashlimit-burst 1 --hashlimit-mode srcip,dstip --hashlimit-srcmask 24 --hashlimit-dstmask 24 --hashlimit-name loglimit-0 -j NFLOG --nflog-group 1 --nflog-range 128 2024/05/02-00:05:51 2024-05-02 00:05:51,237 WARNING: r1: Router(r1): proc failed: rc 127 pid 51424 2024/05/02-00:05:51 args: /usr/bin/nsenter --mount=/proc/50823/ns/mnt --net=/proc/50823/ns/net --uts=/proc/50823/ns/uts -F /bin/bash -c iptables -A FORWARD -i r1-gre0 -o r1-gre0 -m hashlimit --hashlimit-upto 4/minute --hashlimit-burst 1 --hashlimit-mode srcip,dstip --hashlimit-srcmask 24 --hashlimit-dstmask 24 --hashlimit-name loglimit-0 -j NFLOG --nflog-group 1 --nflog-range 128 2024/05/02-00:05:51 stdout: /bin/bash: iptables: command not found 2024/05/02-00:05:51 stderr: empty 2024/05/02-00:05:51 2024-05-02 00:05:51,238 INFO: topo: output: /bin/bash: iptables: command not found

dleroy avatar May 06 '24 14:05 dleroy

If you are adding usage of iptables and it is not present you should skip the test on that platform. See the bgp_as_allow_in/test_bgp_as_allow_in.py script and look at setup_module and add a similiar test to the kernel version and pytest.skip(...) if its not available.

donaldsharp avatar May 17 '24 14:05 donaldsharp

If you are adding usage of iptables and it is not present you should skip the test on that platform. See the bgp_as_allow_in/test_bgp_as_allow_in.py script and look at setup_module and add a similiar test to the kernel version and pytest.skip(...) if its not available.

Thanks Donald. I pushed the fix to ignore unsupported platforms already. Now just trying to get all the unrelated basics tests to pass. i.e.

🚫 ospf6_ecmp_inter_area.test_ospf6_ecmp_inter_area test_ecmp_inter_area

E AssertionError: 'r1' wrong number of route nexthops assert [1, 1, 1, 1, 1, 1, 2, 2, 2] == [1, 1, 1, 1, 1, 2, 2, 2, 2] At index 5 diff: 1 != 2 Full diff: - [1, 1, 1, 1, 1, 2, 2, 2, 2] ? --- + [1, 1, 1, 1, 1, 1, 2, 2, 2] ? +++

dleroy avatar May 17 '24 14:05 dleroy

@mergifyio backport stable/8.4

Jafaral avatar May 30 '24 23:05 Jafaral

backport stable/8.4

✅ Backports have been created

mergify[bot] avatar May 30 '24 23:05 mergify[bot]