frr icon indicating copy to clipboard operation
frr copied to clipboard

isisd: Fix memory leaks when the transition of neighbor state from non-UP to DOWN

Open zhou-run opened this issue 1 year ago • 18 comments

When receiving a hello packet, if the neighbor state transitions directly from a non-ISIS_ADJ_UP state (such as ISIS_ADJ_INITIALIZING) to ISIS_ADJ_DOWN state, the neighbor entry cannot be deleted. If the neighbor is removed or the neighbor's System ID changes, it may result in memory leakage in the neighbor entry.

Test Scenario: LAN link between Router A and Router B is established. Router A does not configure neighbor authentication, while Router B is configured with neighbor authentication. When the neighbor entry on Router B ages out, the neighbor state on Router A transitions to INIT. If Router B is then removed, the neighbor state on Router A transitions to DOWN and persists.

Signed-off-by: zhou-run [email protected]

zhou-run avatar Apr 09 '24 13:04 zhou-run

Does the failed test make sense, or in other words, should the neighbor not be deleted when transitioning from UP to DOWN state? https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO5U18ARM8-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO8U18AMD64-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-ASAN7-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO1DEB10AMD64-2673

zhou-run avatar Apr 10 '24 07:04 zhou-run

Does the failed test make sense, or in other words, should the neighbor not be deleted when transitioning from UP to DOWN state? https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO5U18ARM8-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO8U18AMD64-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-ASAN7-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO1DEB10AMD64-2673

I re-run the failed test to be sure it is not a false positive.

Now, regarding your question, the neighbor is created when you add it in the configuration and when IS-IS Hello messages are received on the given interface and are deleted when you remove it in the configuration. I think, we must keep the context when the interface goes from UP to DOWN to speed up the process when the interface goes from DOWN to UP. In particular, when an interface (frequent when optical connector SFP reach end of life) flap (going Down, then Up, then Down ...), if you remove the context, the code will delete dynamic memory allocation and then recreate it and so on. This could decrease performance and at the end the risk of a memory corruption. So, I think it is preferable to keep the context if the interface goes DOWN and only remove the context when the neighbor is removed from the configuration

odd22 avatar Apr 16 '24 15:04 odd22

Does the failed test make sense, or in other words, should the neighbor not be deleted when transitioning from UP to DOWN state? https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO5U18ARM8-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO8U18AMD64-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-ASAN7-2673 https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO1DEB10AMD64-2673

I re-run the failed test to be sure it is not a false positive.

Now, regarding your question, the neighbor is created when you add it in the configuration and when IS-IS Hello messages are received on the given interface and are deleted when you remove it in the configuration. I think, we must keep the context when the interface goes from UP to DOWN to speed up the process when the interface goes from DOWN to UP. In particular, when an interface (frequent when optical connector SFP reach end of life) flap (going Down, then Up, then Down ...), if you remove the context, the code will delete dynamic memory allocation and then recreate it and so on. This could decrease performance and at the end the risk of a memory corruption. So, I think it is preferable to keep the context if the interface goes DOWN and only remove the context when the neighbor is removed from the configuration

Does 'remove the context when the neighbor is removed from the configuration' refer to the command 'clear isis neighbor [WORD]'?

zhou-run avatar Apr 17 '24 09:04 zhou-run

Does 'remove the context when the neighbor is removed from the configuration' refer to the command 'clear isis neighbor [WORD]'? No. I'm referring to CLI no isis router WORD at the interface level which delete the IS-IS configuration on a given interface.

odd22 avatar Apr 30 '24 13:04 odd22

this says there are linter problems, but it doesn't show that the are ... let's try rerunning ci to see if we can find them

riw777 avatar May 07 '24 15:05 riw777

ci:rerun

riw777 avatar May 07 '24 15:05 riw777

Does 'remove the context when the neighbor is removed from the configuration' refer to the command 'clear isis neighbor [WORD]'? No. I'm referring to CLI no isis router WORD at the interface level which delete the IS-IS configuration on a given interface.

Does this mean that when the IS-IS configuration is not removed on the specified interface, we do not need to consider potential memory leaks caused by frequent changes in the System ID due to neighbor updates? Also, if neighbors are not deleted, should the neighbor state transition from DOWN to UP again also ensure that it goes through the INITIALIZING state?

zhou-run avatar May 13 '24 05:05 zhou-run

@Mergifyio backport stable/10.0

ton31337 avatar May 17 '24 06:05 ton31337

backport stable/10.0

✅ Backports have been created

mergify[bot] avatar May 17 '24 06:05 mergify[bot]

@frrbot rereview

ton31337 avatar May 17 '24 07:05 ton31337

Squash to a single commit all 3 commits.

Should I create a new Pull Request to merge the commits? Because I can't find this Pull Request in my repository, it is in the FRRouting/frr repository, but I can't find the "Merge pull request" button.

zhou-run avatar May 18 '24 01:05 zhou-run

Squash to a single commit all 3 commits.

Done.

zhou-run avatar May 31 '24 03:05 zhou-run

Does 'remove the context when the neighbor is removed from the configuration' refer to the command 'clear isis neighbor [WORD]'? No. I'm referring to CLI no isis router WORD at the interface level which delete the IS-IS configuration on a given interface.

Does this mean that when the IS-IS configuration is not removed on the specified interface, we do not need to consider potential memory leaks caused by frequent changes in the System ID due to neighbor updates? Also, if neighbors are not deleted, should the neighbor state transition from DOWN to UP again also ensure that it goes through the INITIALIZING state?

are we still waiting on an answer to this before pushing?

should this be labeled a bugfix? it seems like it should ...

riw777 avatar Jun 04 '24 11:06 riw777

No. I'm referring to CLI no isis router WORD at the interface level which delete the IS-IS configuration on a given interface.

Does this mean that when the IS-IS configuration is not removed on the specified interface, we do not need to consider potential memory leaks caused by frequent changes in the System ID due to neighbor updates? Also, if neighbors are not deleted, should the neighbor state transition from DOWN to UP again also ensure that it goes through the INITIALIZING state?

I tried the scenario of a P2P link. If the link goes DOWN without deleting the neighbor table entry, the new neighbor cannot be established because the old neighbor is still occupying the slot, unless ISIS is deleted and reconfigured. @odd22 @riw777

zhou-run avatar Jun 06 '24 01:06 zhou-run

I tried the scenario of a P2P link. If the link goes DOWN without deleting the neighbor table entry, the new neighbor cannot be established because the old neighbor is still occupying the slot, unless ISIS is deleted and reconfigured. @odd22 @riw777

OK. So, in this case, we must delete the neighbor context to be sure that the adjacency could goes UP when the link goes UP again. Just to be sure, when you perform the test, you shutdown the interface and then no shut the interface ? You don't try to change the neighbor or the configuration of the interface ?

odd22 avatar Jun 11 '24 14:06 odd22

I just perform a test on one of my router: set the interface down with ìp link set down dev ethX` check that we lost the IS-IS adjacency and thus the neighbor. Then, set the link up again and we get back the IS-IS neighbor. I perform the test on a standard FRR. Not compiled with your patch. So, the original code handle correctly link failure when interface is P2P. So, in case of the state is not UP, I think you are right and neighbor should be removed to handle the memory leak and also the case where the adjacency is UP and goes down, then UP again even if it consume some CPU/Memory allocation in case of intermittent link failure.

odd22 avatar Jun 11 '24 14:06 odd22

I tried the scenario of a P2P link. If the link goes DOWN without deleting the neighbor table entry, the new neighbor cannot be established because the old neighbor is still occupying the slot, unless ISIS is deleted and reconfigured. @odd22 @riw777

OK. So, in this case, we must delete the neighbor context to be sure that the adjacency could goes UP when the link goes UP again. Just to be sure, when you perform the test, you shutdown the interface and then no shut the interface ? You don't try to change the neighbor or the configuration of the interface ?

This is an exceptional scenario. I did not shutdown the interface and then no shut the interface, but due to failed adjacency authentication, the neighbor became unidirectional. After the IS table entry for the neighbor aged out, it relearned the INIT entry for the neighbor. At this point, when I modified the neighbor's IS net, it resulted in the new neighbor unable to establish adjacency.

zhou-run avatar Jun 12 '24 09:06 zhou-run

what requested change are we waiting on here?

riw777 avatar Jun 24 '24 21:06 riw777

what requested change are we waiting on here?

I don't know either. @ton31337

zhou-run avatar Jun 25 '24 01:06 zhou-run