otp
otp copied to clipboard
The net_kernel:setopts call can deadlock the net_kernel process.
Describe the bug
The net_kernel:setopts(new, ...)
can deadlock the net_kernel
process permanently. This happens if the call is made concurrently with an incoming dist connection handshake.
To Reproduce With the official docker container:
$ docker run --name otpbug -h otpbug -it erlang:25.0.2 /bin/bash
$ erl -sname foo
% Repeatedly call setopts(new)
> spawn_link(fun D() -> net_kernel:setopts(new, [{buffer, 10000}]), timer:sleep(10), D() end).
In another shell in the container:
$ docker exec -ti otpbug /bin/bash
% Repeatedly connect&disconnect with erl_call
$ while true; do /usr/local/lib/erlang/lib/erl_interface-5.3/bin/erl_call -n foo@otpbug -a 'erlang now []' ; echo; sleep 0.5; done
The erl_call
loop quickly gets stuck and the node no longer accepts dist connections.
Affected versions
We've seen this in our code on R22, and I've tried 24 and 25 with same result. Bug probably was there since setopts
was introduced?
Additional context
I think what happens is that net_kernel:setopts_new_1/1
is calling into pending connections. The incoming one is pending (doing handshake), but at the end of the handshake it calls into net_kernel
(dist_util:mark_nodeup/2
), so it's a deadlock.
For completeness, here's the process_info and backtrace of net_kernel
:
(foo@otpbug)2> process_info(whereis(net_kernel)).
[{registered_name,net_kernel},
{current_function,{net_kernel,call_owner,2}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,22},
{links,[<0.62.0>,<0.160.0>,#Port<0.3>,<0.63.0>,<0.58.0>]},
{dictionary,[{longnames,false},
{'$initial_call',{net_kernel,init,1}},
{'$ancestors',[net_sup,kernel_sup,<0.47.0>]}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,max},
{group_leader,<0.46.0>},
{total_heap_size,752},
{heap_size,376},
{stack_size,30},
{reductions,53487},
{garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,96}]},
{suspending,[]}]
(foo@otpbug)3> {_, X} = process_info(whereis(net_kernel), backtrace), io:format("~s", [X]).
Program counter: 0x00007f274e2781d0 (net_kernel:call_owner/2 + 344)
y(0) []
y(1) []
y(2) #Ref<0.2457213282.1092354051.122147>
0x00007f2727281eb8 Return addr 0x00007f274e071e9c (lists:foreach_1/2 + 148)
y(0) []
y(1) #Fun<net_kernel.9.28134678>
0x00007f2727281ed0 Return addr 0x00007f274e278a4c (net_kernel:setopts_new_1/1 + 684)
0x00007f2727281ed8 Return addr 0x00007f274e2696c4 (net_kernel:handle_call/3 + 1356)
y(0) {state,'foo@otpbug',shortnames,{tick,<0.63.0>,60000,4},7000,sys_dist,#{<0.160.0>=>'c17@otpbug'},#{},#{},[{listen,#Port<0.3>,<0.62.0>,{net_address,{{0,0,0,0},40037},"otpbug",tcp,inet},inet_tcp_dist}],[],0,all,#{},net_sup}
y(1) {<0.140.0>,#Ref<0.2457213282.1092354051.122143>}
0x00007f2727281ef0 Return addr 0x00007f274e1acb1c (gen_server:try_handle_call/4 + 244)
y(0) []
y(1) []
y(2) Catch 0x00007f274e1acb70 (gen_server:try_handle_call/4 + 328)
0x00007f2727281f10 Return addr 0x00007f274e1ad028 (gen_server:handle_msg/6 + 200)
y(0) []
y(1) []
y(2) []
y(3) {setopts,new,[{buffer,10000}]}
y(4) {<0.140.0>,#Ref<0.2457213282.1092354051.122143>}
y(5) infinity
y(6) net_kernel
y(7) {state,'foo@otpbug',shortnames,{tick,<0.63.0>,60000,4},7000,sys_dist,#{<0.160.0>=>'c17@otpbug'},#{},#{},[{listen,#Port<0.3>,<0.62.0>,{net_address,{{0,0,0,0},40037},"otpbug",tcp,inet},inet_tcp_dist}],[],0,all,#{},net_sup}
y(8) net_kernel
y(9) <0.58.0>
0x00007f2727281f68 Return addr 0x00007f274e094740 (proc_lib:init_p_do_apply/3 + 192)
y(0) []
y(1) []
y(2) Catch 0x00007f274e09475e (proc_lib:init_p_do_apply/3 + 222)
0x00007f2727281f88 Return addr 0x00007f274df94c30 (<terminate process normally>)
The incoming connection's backtrace:
Program counter: 0x00007f274e363dcc (dist_util:mark_nodeup/3 + 332)
y(0) 'c17@otpbug'
y(1) []
y(2) []
y(3) <0.61.0>
y(4) {hs_data,<0.61.0>,'c17@otpbug','foo@otpbug',#Port<0.31>,<0.161.0>,17297788828,[],undefined,17297788828,true,fun inet_tcp:send/2,fun inet_tcp:recv/3,#Fun<inet_tcp_dist.0.130765917>,#Fun<inet_tcp_dist.1.130765917>,#Fun<inet_tcp_dist.2.130765917>,#Fun<inet_tcp_dist.3.130765917>,#Fun<inet_tcp_dist.4.130765917>,fun inet_tcp_dist:getstat/1,normal,fun inet_tcp_dist:setopts/2,fun inet_tcp_dist:getopts/2,undefined,0,0,0,undefined,23251}
0x00007f2727266790 Return addr 0x00007f274e362e90 (dist_util:connection/1 + 464)
y(0) {24,#Ref<0.2457213282.1092485123.120891>}
y(1) []
y(2) 'c17@otpbug'
y(3) hidden
y(4) #Port<0.31>
y(5) {hs_data,<0.61.0>,'c17@otpbug','foo@otpbug',#Port<0.31>,<0.161.0>,17297788828,[],undefined,17297788828,true,fun inet_tcp:send/2,fun inet_tcp:recv/3,#Fun<inet_tcp_dist.0.130765917>,#Fun<inet_tcp_dist.1.130765917>,#Fun<inet_tcp_dist.2.130765917>,#Fun<inet_tcp_dist.3.130765917>,#Fun<inet_tcp_dist.4.130765917>,fun inet_tcp_dist:getstat/1,normal,fun inet_tcp_dist:setopts/2,fun inet_tcp_dist:getopts/2,undefined,0,0,0,undefined,23251}
0x00007f27272667c8 Return addr 0x00007f274df94c30 (<terminate process normally>)
Thanks for the detailed bug report! Yes, this bug has existed since the introduction of the setopts()
functionality. I'll begin working on a PR fixing it,
The PR #6216 should fix this issue (not very well tested yet, though). Unless something unexpected happens it will be part of the next patch.
Fixed in patches OTP 25.0.4 and OTP 24.3.4.3 which we released today.
Awesome. Thank you very much :)