otp icon indicating copy to clipboard operation
otp copied to clipboard

The net_kernel:setopts call can deadlock the net_kernel process.

Open villiros opened this issue 2 years ago • 2 comments

Describe the bug The net_kernel:setopts(new, ...) can deadlock the net_kernel process permanently. This happens if the call is made concurrently with an incoming dist connection handshake.

To Reproduce With the official docker container:

$ docker run --name otpbug -h otpbug -it erlang:25.0.2 /bin/bash
$ erl -sname foo
% Repeatedly call setopts(new)
> spawn_link(fun D() -> net_kernel:setopts(new, [{buffer, 10000}]), timer:sleep(10), D() end).

In another shell in the container:

$ docker exec -ti otpbug /bin/bash
% Repeatedly connect&disconnect with erl_call
$ while true; do /usr/local/lib/erlang/lib/erl_interface-5.3/bin/erl_call -n foo@otpbug -a 'erlang now []' ; echo; sleep 0.5; done

The erl_call loop quickly gets stuck and the node no longer accepts dist connections.

Affected versions We've seen this in our code on R22, and I've tried 24 and 25 with same result. Bug probably was there since setopts was introduced?

Additional context

I think what happens is that net_kernel:setopts_new_1/1 is calling into pending connections. The incoming one is pending (doing handshake), but at the end of the handshake it calls into net_kernel (dist_util:mark_nodeup/2), so it's a deadlock.

For completeness, here's the process_info and backtrace of net_kernel:

(foo@otpbug)2> process_info(whereis(net_kernel)).
[{registered_name,net_kernel},
 {current_function,{net_kernel,call_owner,2}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,22},
 {links,[<0.62.0>,<0.160.0>,#Port<0.3>,<0.63.0>,<0.58.0>]},
 {dictionary,[{longnames,false},
              {'$initial_call',{net_kernel,init,1}},
              {'$ancestors',[net_sup,kernel_sup,<0.47.0>]}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,max},
 {group_leader,<0.46.0>},
 {total_heap_size,752},
 {heap_size,376},
 {stack_size,30},
 {reductions,53487},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,96}]},
 {suspending,[]}]
(foo@otpbug)3> {_, X} = process_info(whereis(net_kernel), backtrace), io:format("~s", [X]).
Program counter: 0x00007f274e2781d0 (net_kernel:call_owner/2 + 344)
y(0)     []
y(1)     []
y(2)     #Ref<0.2457213282.1092354051.122147>

0x00007f2727281eb8 Return addr 0x00007f274e071e9c (lists:foreach_1/2 + 148)
y(0)     []
y(1)     #Fun<net_kernel.9.28134678>

0x00007f2727281ed0 Return addr 0x00007f274e278a4c (net_kernel:setopts_new_1/1 + 684)

0x00007f2727281ed8 Return addr 0x00007f274e2696c4 (net_kernel:handle_call/3 + 1356)
y(0)     {state,'foo@otpbug',shortnames,{tick,<0.63.0>,60000,4},7000,sys_dist,#{<0.160.0>=>'c17@otpbug'},#{},#{},[{listen,#Port<0.3>,<0.62.0>,{net_address,{{0,0,0,0},40037},"otpbug",tcp,inet},inet_tcp_dist}],[],0,all,#{},net_sup}
y(1)     {<0.140.0>,#Ref<0.2457213282.1092354051.122143>}

0x00007f2727281ef0 Return addr 0x00007f274e1acb1c (gen_server:try_handle_call/4 + 244)
y(0)     []
y(1)     []
y(2)     Catch 0x00007f274e1acb70 (gen_server:try_handle_call/4 + 328)

0x00007f2727281f10 Return addr 0x00007f274e1ad028 (gen_server:handle_msg/6 + 200)
y(0)     []
y(1)     []
y(2)     []
y(3)     {setopts,new,[{buffer,10000}]}
y(4)     {<0.140.0>,#Ref<0.2457213282.1092354051.122143>}
y(5)     infinity
y(6)     net_kernel
y(7)     {state,'foo@otpbug',shortnames,{tick,<0.63.0>,60000,4},7000,sys_dist,#{<0.160.0>=>'c17@otpbug'},#{},#{},[{listen,#Port<0.3>,<0.62.0>,{net_address,{{0,0,0,0},40037},"otpbug",tcp,inet},inet_tcp_dist}],[],0,all,#{},net_sup}
y(8)     net_kernel
y(9)     <0.58.0>

0x00007f2727281f68 Return addr 0x00007f274e094740 (proc_lib:init_p_do_apply/3 + 192)
y(0)     []
y(1)     []
y(2)     Catch 0x00007f274e09475e (proc_lib:init_p_do_apply/3 + 222)

0x00007f2727281f88 Return addr 0x00007f274df94c30 (<terminate process normally>)

The incoming connection's backtrace:

Program counter: 0x00007f274e363dcc (dist_util:mark_nodeup/3 + 332)
y(0)     'c17@otpbug'
y(1)     []
y(2)     []
y(3)     <0.61.0>
y(4)     {hs_data,<0.61.0>,'c17@otpbug','foo@otpbug',#Port<0.31>,<0.161.0>,17297788828,[],undefined,17297788828,true,fun inet_tcp:send/2,fun inet_tcp:recv/3,#Fun<inet_tcp_dist.0.130765917>,#Fun<inet_tcp_dist.1.130765917>,#Fun<inet_tcp_dist.2.130765917>,#Fun<inet_tcp_dist.3.130765917>,#Fun<inet_tcp_dist.4.130765917>,fun inet_tcp_dist:getstat/1,normal,fun inet_tcp_dist:setopts/2,fun inet_tcp_dist:getopts/2,undefined,0,0,0,undefined,23251}

0x00007f2727266790 Return addr 0x00007f274e362e90 (dist_util:connection/1 + 464)
y(0)     {24,#Ref<0.2457213282.1092485123.120891>}
y(1)     []
y(2)     'c17@otpbug'
y(3)     hidden
y(4)     #Port<0.31>
y(5)     {hs_data,<0.61.0>,'c17@otpbug','foo@otpbug',#Port<0.31>,<0.161.0>,17297788828,[],undefined,17297788828,true,fun inet_tcp:send/2,fun inet_tcp:recv/3,#Fun<inet_tcp_dist.0.130765917>,#Fun<inet_tcp_dist.1.130765917>,#Fun<inet_tcp_dist.2.130765917>,#Fun<inet_tcp_dist.3.130765917>,#Fun<inet_tcp_dist.4.130765917>,fun inet_tcp_dist:getstat/1,normal,fun inet_tcp_dist:setopts/2,fun inet_tcp_dist:getopts/2,undefined,0,0,0,undefined,23251}

0x00007f27272667c8 Return addr 0x00007f274df94c30 (<terminate process normally>)

villiros avatar Jul 04 '22 21:07 villiros

Thanks for the detailed bug report! Yes, this bug has existed since the introduction of the setopts() functionality. I'll begin working on a PR fixing it,

rickard-green avatar Aug 11 '22 13:08 rickard-green

The PR #6216 should fix this issue (not very well tested yet, though). Unless something unexpected happens it will be part of the next patch.

rickard-green avatar Aug 11 '22 23:08 rickard-green

Fixed in patches OTP 25.0.4 and OTP 24.3.4.3 which we released today.

rickard-green avatar Aug 18 '22 19:08 rickard-green

Awesome. Thank you very much :)

villiros avatar Aug 18 '22 19:08 villiros