Fix net_kernel crash
We encountered the following crash of net_kernel process in the wild:
2023-12-18T10:29:56.719469+00:00 [error] Garbage collecting distribution entry for node '...' in state: pending connect
2023-12-18T10:29:57.914445+00:00 [error] Generic server net_kernel terminating. Reason: {bad_return_value,{'EXIT',{badarg,[{erts_internal,abort_pending_connection,['....',{3208,#Ref<0.2415753947.1844838460.100366>}],[]},{net_kernel,pending_nodedown,5,[{file,"net_kernel.erl"},{line,1144}]},{net_kernel,conn_own_exit,3,[{file,"net_kernel.erl"},{line,1054}]},{net_kernel,do_handle_exit,3,[{file,"net_kernel.erl"},{line,1021}]},{net_kernel,handle_exit,3,[{file,"net_kernel.erl"},{line,1016}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,689}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,765}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}}}. Last message: {'EXIT',<0.15226.108>,shutdown}. State: ....
Right now I don't have the full picture of the mechanism behind the race condition between try_delete_dist_entry function in erl_node_tables.c and net_kernel:handle_info.
One thing is clear though: net_kernel:handle_exit function tries to ignore this (or similar errors) by catching all exceptions, but fails at doing so. It passes the caught exception to the gen_server, which rejects it.
This crash of net_kernel process eventually escalates to the kernel_sup and brings down the whole node.
This is a simple fix for the original workaround.
With what OTP version did you encounter that crash?
With what OTP version did you encounter that crash?
OTP 24.3 [erts-12.3.2.2]. The related code hasn't been changed for a very long time, though, so all recent releases are affected.
Was about to give this year-old PR some love but found it faulty.
The functions called in do_handle_exit() uses throw() to return back sensible {noreply,NewState} or {stop,_,_} tuples back from handle_info(). This fix will catch that throw and in case of noreply lose the NewState.
And in case of an error I'm not sure it is safe to let net_kernel catch it and survive. It could be better to let the node restart instead of continue living with an inconsistent net_kernel state.