otp Fix net_kernel crash

We encountered the following crash of net_kernel process in the wild:

2023-12-18T10:29:56.719469+00:00 [error] Garbage collecting distribution entry for node '...' in state: pending connect
2023-12-18T10:29:57.914445+00:00 [error] Generic server net_kernel terminating. Reason: {bad_return_value,{'EXIT',{badarg,[{erts_internal,abort_pending_connection,['....',{3208,#Ref<0.2415753947.1844838460.100366>}],[]},{net_kernel,pending_nodedown,5,[{file,"net_kernel.erl"},{line,1144}]},{net_kernel,conn_own_exit,3,[{file,"net_kernel.erl"},{line,1054}]},{net_kernel,do_handle_exit,3,[{file,"net_kernel.erl"},{line,1021}]},{net_kernel,handle_exit,3,[{file,"net_kernel.erl"},{line,1016}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,689}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,765}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}}}. Last message: {'EXIT',<0.15226.108>,shutdown}. State: ....

Right now I don't have the full picture of the mechanism behind the race condition between try_delete_dist_entry function in erl_node_tables.c and net_kernel:handle_info. One thing is clear though: net_kernel:handle_exit function tries to ignore this (or similar errors) by catching all exceptions, but fails at doing so. It passes the caught exception to the gen_server, which rejects it.

This crash of net_kernel process eventually escalates to the kernel_sup and brings down the whole node.

This is a simple fix for the original workaround.

Dec 18 '23 16:12 ieQu1

With what OTP version did you encounter that crash?

Dec 19 '23 10:12 sverker

With what OTP version did you encounter that crash?

OTP 24.3 [erts-12.3.2.2]. The related code hasn't been changed for a very long time, though, so all recent releases are affected.

Dec 19 '23 11:12 ieQu1

Was about to give this year-old PR some love but found it faulty.

The functions called in do_handle_exit() uses throw() to return back sensible {noreply,NewState} or {stop,_,_} tuples back from handle_info(). This fix will catch that throw and in case of noreply lose the NewState.

And in case of an error I'm not sure it is safe to let net_kernel catch it and survive. It could be better to let the node restart instead of continue living with an inconsistent net_kernel state.

Feb 12 '25 20:02 sverker