frr icon indicating copy to clipboard operation
frr copied to clipboard

isisd: fix crash when deactivating ISIS adjacency on the interface.

Open zhou-run opened this issue 9 months ago • 8 comments

  1. When the command "no <ip|ipv6> router isis WORD" is executed on the interface, it invokes list_delete_all_node to iterate and release the memory of all nodes in the cirtcuit->u.bc.adjdb[1] linked list. However, the nodes are not unlinked during this traversal process, leading to the call of *list->del to delete the data of the linked list nodes.
  2. For ISIS, deleting the data of the linked list nodes is done by calling isis_delete_adj. Subsequently, isis_level2_adj_up will be called to iterate and query the cirtcuit->u.bc.adjdb[1] linked list. If there are many neighbors on this interface, accessing the memory of the released linked list nodes may occur.
  3. Not limited to ISIS, if the linked list is not unlinked during the deletion of all nodes in process 1, *list->del should not be allowed to iterate through the list again.

Signed-off-by: zhou-run [email protected]

zhou-run avatar May 11 '24 07:05 zhou-run

Can you show the backtrace?

In my opinion the problem is in isis_circuit_up() when we create a list we don't specify circuit->u.bc.adjdb[0]->del = isis_delete_adj;.

The issue lies in isis_circuit_down(), not isis_circuit_up(). The backtrace is as follows:

(gdb) bt
#0  0x00007f7d6e541fe1 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f7d6e63188c in core_handler (signo=11, siginfo=0x7ffc0ced2630, context=<optimized out>) at lib/sigevent.c:262
#2  <signal handler called>
#3  0x00005647f5b11568 in isis_level2_adj_up (area=area@entry=0x5647f7c89830) at isisd/isis_lsp.c:423
#4  0x00005647f5b14073 in isis_reset_attach_bit (adj=0x5647f7cad690) at isisd/isis_lsp.c:474
#5  lsp_handle_adj_state_change (adj=0x5647f7cad690) at isisd/isis_lsp.c:2162
#6  0x00005647f5b53675 in hook_call_isis_adj_state_change_hook (adj=adj@entry=0x5647f7cad690) at isisd/isis_adjacency.c:152
#7  0x00005647f5b536f3 in isis_delete_adj (arg=0x5647f7cad690) at isisd/isis_adjacency.c:167
#8  0x00007f7d6e5fe003 in list_delete_all_node (list=0x5647f7c88060) at lib/linklist.c:316
#9  0x00007f7d6e5fe069 in list_delete (list=list@entry=0x5647f7c84708) at lib/linklist.c:326
#10 0x00005647f5b0872e in isis_circuit_down (circuit=0x5647f7c84620) at isisd/isis_circuit.c:835
#11 0x00005647f5b09f81 in isis_csm_state_change (event=event@entry=IF_DOWN_FROM_Z, circuit=circuit@entry=0x5647f7c84620, arg=arg@entry=0x5647f7c7f7a0)
    at isisd/isis_csm.c:196
#12 0x00005647f5b083b0 in isis_circuit_disable (circuit=0x5647f7c84620) at isisd/isis_circuit.c:100
#13 isis_circuit_del (circuit=0x5647f7c84620) at isisd/isis_circuit.c:200
#14 0x00005647f5b434f5 in lib_interface_isis_destroy (args=<optimized out>) at isisd/isis_nb_config.c:2612
#15 0x00007f7d6e61347a in nb_callback_destroy (errmsg_len=2, errmsg=0x7ffc0ced38d0 "", dnode=0x5647f7c948f0, event=NB_EV_APPLY, nb_node=<optimized out>,
    context=<optimized out>) at lib/northbound.c:1131
#16 nb_callback_configuration (context=<optimized out>, event=event@entry=NB_EV_APPLY, change=change@entry=0x5647f7cb6680, errmsg=errmsg@entry=0x7ffc0ced38d0 "",
    errmsg_len=errmsg_len@entry=8192) at lib/northbound.c:1356
#17 0x00007f7d6e6138b7 in nb_transaction_process (errmsg_len=8192, errmsg=0x7ffc0ced38d0 "", transaction=0x5647f7c94080, event=NB_EV_APPLY)
    at lib/northbound.c:1473
#18 nb_candidate_commit_apply (transaction=0x5647f7c94080, save_transaction=save_transaction@entry=true, transaction_id=transaction_id@entry=0x0,
    errmsg=errmsg@entry=0x7ffc0ced38d0 "", errmsg_len=errmsg_len@entry=8192) at lib/northbound.c:906
#19 0x00007f7d6e61403d in nb_candidate_commit (context=context@entry=0x7ffc0ced38c0, candidate=<optimized out>, save_transaction=save_transaction@entry=true,
    comment=comment@entry=0x0, transaction_id=transaction_id@entry=0x0, errmsg=errmsg@entry=0x7ffc0ced38d0 "", errmsg_len=8192) at lib/northbound.c:938
#20 0x00007f7d6e616ec9 in nb_cli_classic_commit (vty=0x5647f7cae160) at lib/northbound_cli.c:64
#21 0x00007f7d6e6176a8 in nb_cli_apply_changes (vty=0x5647f7cae160, xpath_base_fmt=<optimized out>) at lib/northbound_cli.c:268
#22 0x00007f7d6e5d918e in cmd_execute_command_real (vline=vline@entry=0x5647f7cae140, vty=vty@entry=0x5647f7cae160, cmd=cmd@entry=0x0, up_level=up_level@entry=0,
    filter=FILTER_RELAXED) at lib/command.c:971
#23 0x00007f7d6e5d951d in cmd_execute_command (vline=vline@entry=0x5647f7cae140, vty=vty@entry=0x5647f7cae160, cmd=cmd@entry=0x0, vtysh=vtysh@entry=0)
    at lib/command.c:1030
#24 0x00007f7d6e5d9770 in cmd_execute (vty=vty@entry=0x5647f7cae160, cmd=cmd@entry=0x5647f7cb48a0 "no ip router isis 10", matched=matched@entry=0x0,
    vtysh=vtysh@entry=0) at lib/command.c:1198
#25 0x00007f7d6e6485e6 in vty_command (vty=vty@entry=0x5647f7cae160, buf=0x5647f7cb48a0 "no ip router isis 10") at lib/vty.c:483
#26 0x00007f7d6e648d01 in vty_execute (vty=vty@entry=0x5647f7cae160) at lib/vty.c:1246
#27 0x00007f7d6e64ba40 in vtysh_read (thread=<optimized out>) at lib/vty.c:2090
#28 0x00007f7d6e64348d in thread_call (thread=thread@entry=0x7ffc0ced8310) at lib/thread.c:1958
#29 0x00007f7d6e5fd4a8 in frr_run (master=0x5647f79a43d0) at lib/libfrr.c:1184
#30 0x00005647f5b050f3 in main (argc=5, argv=<optimized out>, envp=<optimized out>) at isisd/isis_main.c:273
(gdb) f 3
#3  0x00005647f5b11568 in isis_level2_adj_up (area=area@entry=0x5647f7c89830) at isisd/isis_lsp.c:423
423     isisd/isis_lsp.c: No such file or directory.
(gdb) p node
$1 = (struct listnode *) 0x110
(gdb) f 8
#8  0x00007f7d6e5fe003 in list_delete_all_node (list=0x5647f7c88060) at lib/linklist.c:316
316     lib/linklist.c: No such file or directory.
(gdb) p list->head->data
$2 = (void *) 0x5647f7cabf20
(gdb) p list->head->next->data
$3 = (void *) 0x5647f7c9bb60
(gdb) p list->head->next->next->data
Cannot access memory at address 0x120
(gdb) p list->head->next->next
$4 = (struct listnode *) 0x110

The backtrace provided above pertains to version 8.2.2, but it seems that the same issue exists in the code of the master branch as well.

zhou-run avatar May 13 '24 02:05 zhou-run

The issue lies in isis_circuit_down(), not isis_circuit_up().

Yes, but the list needs to have a deletion hook when it's created. So it might be an issue in timing (or not).

ton31337 avatar May 13 '24 04:05 ton31337

The issue lies in isis_circuit_down(), not isis_circuit_up().

Yes, but the list needs to have a deletion hook when it's created. So it might be an issue in timing (or not).

In addition to the list deletion operation list_delete, if only a specific node in the linked list is to be deleted (as in the implementation of isis_adj_state_change), the function that should be called is listnode_delete or list_delete_node. This does not call *list->del, and the caller is responsible for deleting the data associated with the node. Therefore, it should be fine to create the linked list without adding a deletion hook.

zhou-run avatar May 13 '24 06:05 zhou-run

Someone clearly understanding the IS-IS code should take a look here. /cc @odd22 @louis-6wind @pguibert6WIND maybe?

ton31337 avatar May 17 '24 06:05 ton31337

I'll take a look at it

louis-6wind avatar May 17 '24 08:05 louis-6wind

The problem is that we are building a temporary link-list adj_list containing the up adjacencies. The temporary list is deleted and their nodes are freed but still referenced in u.bc.adjdb[X]. This is a heap-after-free issue.

There's no need to use a temporary link-list. The problem should be fixed in all places where isis_adj_build_up_list() is used to build a temporary list.

Deleting the temporary link-list adj_list only removes the temporary nodes of the temporary link-list, which are separate from the actual nodes of the real link-list u.bc.adjdb[X] . Even though the data stored in the nodes node->data is the same memory, the temporary link-list adj_list does not have an operation to delete the node->data. Why would deleting the temporary link-list adj_list affect the real link-list u.bc.adjdb[X] ? On the contrary, the actual operation of deleting node->data is performed when deleting the real link-list u.bc.adjdb[X] because it points to the delete operation isis_delete_adj() . I believe this is where the heap-after-free issue lies.

zhou-run avatar May 17 '24 11:05 zhou-run

Deleting the temporary link-list adj_list only removes the temporary nodes of the temporary link-list, which are separate from the actual nodes of the real link-list u.bc.adjdb[X] . Even though the data stored in the nodes node->data is the same memory, the temporary link-list adj_list does not have an operation to delete the node->data. Why would deleting the temporary link-list adj_list affect the real link-list u.bc.adjdb[X] ?

you are right. I have misinterpreted the backtrace.

I have just loaded frr-8.2.2 tag to interpret the backtrace. I see that the issue is because in list_delete_all_node(), the (*list->del)hook (hook_call_isis_adj_state_change_hook here ) is parsing the link-list that is being deleted via isis_level2_adj_up() -> for (ALL_LIST_ELEMENTS_RO(adjdb, node, adj))

void list_delete_all_node(struct list *list)
{
[...]
	for (node = list->head; node; node = next) {
		next = node->next;
		if (*list->del)
			(*list->del)(node->data);
		listnode_free(list, node);
	}
bool isis_level2_adj_up(struct isis_area *area)
{
[...]
	for (ALL_LIST_ELEMENTS_RO(area->circuit_list, cnode, circuit)) {
		if (circuit->circ_type == CIRCUIT_T_BROADCAST) {
			adjdb = circuit->u.bc.adjdb[1];
[...]
			for (ALL_LIST_ELEMENTS_RO(adjdb, node, adj)) {
				if (adj->level != ISIS_ADJ_LEVEL1
				    && adj->adj_state == ISIS_ADJ_UP)
					return true;

This is not a case of heap-use-after but I am not sure we should allow accessing a list being deleted.

louis-6wind avatar May 17 '24 12:05 louis-6wind

The issue is in lsp_handle_adj_state_change()

static int lsp_handle_adj_state_change(struct isis_adjacency *adj)
{
	lsp_regenerate_schedule(adj->circuit->area, IS_LEVEL_1 | IS_LEVEL_2, 0);

	/* when an adjacency state changes determine if we need to
	 * change attach_bits in other area's LSPs
	 */
	isis_reset_attach_bit(adj);

	return 0;
}

lsp_regenerate_schedule() delays the regeneration of the LSP so that it is done once the u.bc.adjdb[X] adjacency list is fully destroyed. isis_reset_attach_bit() should be called during the LSP generation process, not in lsp_handle_adj_state_change() directly. This call was introduced by https://github.com/FRRouting/frr/pull/7998. @lynne-volta could you help @zhou-run to rework this.

louis-6wind avatar May 17 '24 13:05 louis-6wind

lsp_regenerate_schedule() delays the regeneration of the LSP so that it is done once the u.bc.adjdb[X] adjacency list is fully destroyed. isis_reset_attach_bit() should be called during the LSP generation process, not in lsp_handle_adj_state_change() directly. This call was introduced by #7998. @lynne-volta could you help @zhou-run to rework this.

After the modification in #9024, I don't see the significance of isis_reset_attach_bit() because lsp_handle_adj_state_change() unconditionally calls lsp_regenerate_schedule. @louis-6wind @idryzhov

zhou-run avatar May 28 '24 03:05 zhou-run

lsp_regenerate_schedule() delays the regeneration of the LSP so that it is done once the u.bc.adjdb[X] adjacency list is fully destroyed. isis_reset_attach_bit() should be called during the LSP generation process, not in lsp_handle_adj_state_change() directly. This call was introduced by #7998. @lynne-volta could you help @zhou-run to rework this.

After the modification in #9024, I don't see the significance of isis_reset_attach_bit() because lsp_handle_adj_state_change() unconditionally calls lsp_regenerate_schedule. @louis-6wind @idryzhov

Yes isis_reset_attach_bit() is useless. The function and its call can be removed. This is the fix

louis-6wind avatar May 28 '24 08:05 louis-6wind

lsp_regenerate_schedule() delays the regeneration of the LSP so that it is done once the u.bc.adjdb[X] adjacency list is fully destroyed. isis_reset_attach_bit() should be called during the LSP generation process, not in lsp_handle_adj_state_change() directly. This call was introduced by #7998. @lynne-volta could you help @zhou-run to rework this.

After the modification in #9024, I don't see the significance of isis_reset_attach_bit() because lsp_handle_adj_state_change() unconditionally calls lsp_regenerate_schedule. @louis-6wind @idryzhov

Yes isis_reset_attach_bit() is useless. The function and its call can be removed. This is the fix

This is the root cause, and I will fix it. Additionally, in my view, for the sake of safety, it should also detach the link when traversing the linked list to remove a node.

zhou-run avatar May 28 '24 08:05 zhou-run

Please merge both fixes

louis-6wind avatar May 28 '24 09:05 louis-6wind

Please merge both fixes

Done.

zhou-run avatar May 28 '24 09:05 zhou-run