otp icon indicating copy to clipboard operation
otp copied to clipboard

SSH: missing entry in channel cache

Open etnt opened this issue 1 year ago • 6 comments

https://github.com/erlang/otp/blob/6d4731b9b5c70686bfaea67f2cd6f4f912e6da06/lib/ssh/src/ssh_connection.erl#L1029

Look at the line above where the code try to access: Channel#channel.remote_id I seem to have a case where Channel turns up as undefined , hence causing a crash here.

I prepared the ssh_connection.erlcode like this:

handle_msg(#ssh_msg_channel_request{recipient_channel = ChannelId,
                                    request_type = "exit-signal",
                                    want_reply = false,
                                    data = Data},
           #connection{channel_cache = Cache} = Connection0, _) ->
    <<?DEC_BIN(SigName, _SigLen),
      ?BOOLEAN(_Core),
      ?DEC_BIN(Err, _ErrLen),
      ?DEC_BIN(Lang, _LangLen)>> = Data,
    Channel = ssh_client_channel:cache_lookup(Cache, ChannelId),
    io:format(">>> ~p(~p): ChannelId=~p , Channel=~p~n",[?MODULE,?LINE,ChannelId,Channel]),
    io:format(">>> ~p(~p): CHANNEL-CACHE:~n~p~n",[?MODULE,?LINE,ets:tab2list(Cache)]),
    RemoteId =  Channel#channel.remote_id,
    io:format(">>> ~p(~p): RemoteId=~p~n",[?MODULE,?LINE,RemoteId]),
    {Reply, Connection} =  reply_msg(Channel, Connection0,
                                     {exit_signal, ChannelId,
                                      binary_to_list(SigName),
                                      binary_to_list(Err),
                                      binary_to_list(Lang)}),
    CloseMsg = channel_close_msg(RemoteId),
    {[{connection_reply, CloseMsg}|Reply], Connection};

When running, I see this:

>>> ssh_connection(605): ChannelId=1 , Channel=undefined
>>> ssh_connection(606): CHANNEL-CACHE:
[{channel,"session","subsystem",<0.852.0>,undefined,0,647044,8316,32768,false,
         0,2096622,32768,false,
         {[],[]}}]

Note: the Channel is undefined , so it seems like the code is crashing at the record reference of the remote_id since we don't get the last printout of the RemoteId.

A little about my setup, which is a bit complicated...

  1. I'm setting up an SSH connection from one Erlang node to another (well not quite, see below)
  2. The connection is not "direct"; it is setup toward an external OpenSSH server and we are running a NETCONF subsystem module (that in its turn talks TCP toward the other Erlang node).
  3. We make a Long Running request over the connection on Channel(0) and then a Short Running request over the same connection using Channel(1). It is when the Short Running request is finished this happen causing the Channel(0) to also be closed thus interrupting the Long Running request.

It's kind of a bad description perhaps but anyway, this fact that the code can crash if nothing is found in the Channel Cache doesn't seem good, or?

I can add that in the body of that exit-signal message I get "PIPE" which corresponds to one of the signal names in RFC-4254.

etnt avatar Oct 11 '24 10:10 etnt

yes. code looks like it might crash there.

but I would like to understand more your case

It is when the Short Running request is finished this happen causing the Channel(0) to also be closed thus interrupting the Long Running request

Is this expected? Why this happens?

Can you collect and provide ssh_dbg traces for that situation? You can enable interesting content it with: ssh_dbg:on([channels, channel_event])

or ssh_dbg:on() to collect all dbg output (if it does not produce too much data)

u3s avatar Oct 11 '24 15:10 u3s

To answer your first question. The Long Running request over Cannel(0) should not be terminated like this. It should be able to finish on its own.

Ok, and I forgot to mention, I'm running OTP-22:

# erl
Erlang/OTP 22 [erts-10.7.2.9] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

But that Cache lookup code looks the same as in the latest OTP, or?

Anyway, I ran ssh_dbg:on() which produced a lot of output, but here is an interesting part of it; note that internal error:

16:07:30.101501 <0.855.0> Received SSH_MSG_CHANNEL_REQUEST:                                                                                                                                                
  recipient_channel: 1                                                                                                                                                                                     
  request_type: "exit-signal"                                                                                                                                                                              
  want_reply: false                                                                                                                                                                                        
  data: <<0,0,0,4,80,73,80,69,0,0,0,0,0,0,0,0,0>>                                                                                                                                                          
                                                                                                                                                                                                           
                                                                                                                                                                                                           
16:07:30.101503 <0.855.0> Connection event result                                                                                                                                                          
{keep_state,"#data{}",                                                                                                                                                                                     
    [{next_event,internal,prepare_next_packet},                                                                                                                                                            
     {next_event,internal,                                                                                                                                                                                 
         {conn_msg,                                                                                                                                                                                        
             {ssh_msg_channel_request,1,"exit-signal",false,                                                                                                                                               
                 <<0,0,0,4,80,73,80,69,0,0,0,0,0,0,0,0,0>>}}}]}                                                                                                                                            
                                                                                                                                                                                                           
                                                                                                                                                                                                           
16:07:30.101528 <0.855.0> Connection event                                                                                                                                                                 
EventType: internal                                                                                                                                                                                        
EventContent: prepare_next_packet                                                                                                                                                                          
State: {connected,client}                                                                                                                                                                                  
                                                                                                                                                                                                           
                                                                                                                                                                                                           
16:07:30.101559 <0.855.0> Connection event result                                                                                                                                                          
keep_state_and_data                                                                                                                                                                                        
                                                                                                                                                                                                           
                                                                                                                                                                                                           
16:07:30.101563 <0.855.0> Connection event                                                                                                                                                                 
EventType: internal                                                                                                                                                                                        
EventContent: {conn_msg,                                                                                                                                                                                   
                  {ssh_msg_channel_request,1,"exit-signal",false,                                                                                                                                          
                      <<0,0,0,4,80,73,80,69,0,0,0,0,0,0,0,0,0>>}}                                                                                                                                          
State: {connected,client}                                                                                                                                                                                  
                                                                                                                                                                                                           
                                                                                                                                                                                                           
16:07:30.101804 <0.855.0> Disconnecting:                                                                                                                                                                   
 Module = ssh_connection_handler, Line = 1186, StateName = {connected,client},                                                                                                                             
 Code = 11, Reason = "By application",                                                                                                                                                                     
 DetailedText =                                                                                                                                                                                            
 "Internal error: error:{badrecord,channel}"                                                                                                                                                               
                                                                                                                                                                                                           
16:07:30.101835 <0.855.0> Going to send SSH_MSG_DISCONNECT:                                                                                                                                                
  code: 11                                                                                                                                                                                                 
  description: "By application"                                                                                                                                                                            
  language: undefined

                         
16:07:30.101843 <0.855.0> Going to send plain text bytes (shown before encryption):
<<1,0,0,0,11,0,0,0,14,66,121,32,97,112,112,108,105,99,97,116,105,111,110,0,0,0,
  0>>
                         
16:07:30.101912 <0.858.0> Server Channel (Id=0, CB=netconf_ssh_client) got info:
Info: {ssh_cm,<0.855.0>,{closed,0}}

etnt avatar Oct 11 '24 16:10 etnt

fyi: I made a quick fix, ignoring the cache miss, then my code seem to run as expected:

handle_msg(#ssh_msg_channel_request{recipient_channel = ChannelId,
                                    request_type = "exit-signal",
                                    want_reply = false,
                                    data = Data},
           #connection{channel_cache = Cache} = Connection0, _) ->
    <<?DEC_BIN(SigName, _SigLen),
      ?BOOLEAN(_Core),
      ?DEC_BIN(Err, _ErrLen),
      ?DEC_BIN(Lang, _LangLen)>> = Data,
    Channel = ssh_client_channel:cache_lookup(Cache, ChannelId),
    io:format(">>> ~p(~p): ChannelId=~p , Channel=~p~n",[?MODULE,?LINE,ChannelId,Channel]),
    io:format(">>> ~p(~p): CHANNEL-CACHE:~n~p~n",[?MODULE,?LINE,ets:tab2list(Cache)]),
    case Channel of
        #channel{remote_id = RemoteId} ->
            io:format(">>> ~p(~p): RemoteId=~p~n",[?MODULE,?LINE,RemoteId]),
            {Reply, Connection} =  reply_msg(Channel, Connection0,
                                     {exit_signal, ChannelId,
                                      binary_to_list(SigName),
                                      binary_to_list(Err),
                                      binary_to_list(Lang)}),
            CloseMsg = channel_close_msg(RemoteId),
            {[{connection_reply, CloseMsg}|Reply], Connection};
        _ ->
            %% Ignore the Cache miss!
            {[], Connection0}
    end;

etnt avatar Oct 14 '24 06:10 etnt

can you verify the behavior with current OTP-27 version (or at least OTP-26, OTP-25)? this would allow me to increase priority of the issue.

u3s avatar Oct 14 '24 06:10 u3s

can you verify the behavior with current OTP-27 version (or at least OTP-26, OTP-25)? this would allow me to increase priority of the issue.

Ok, I'll try.

etnt avatar Oct 14 '24 07:10 etnt

Ok, I have now tried it with this OTP version:

# erl
Erlang/OTP 27 [erts-15.1.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]

Running with ssh_dbg:on() I see this:

09:12:55.025229 <0.975.0> TCP stream data arrived
State: {connected,client}
Socket: #Port<0.63>
TcpData:                                                                                             
Ad | Hexdump                                          | ASCII                  
---+--------------------------------------------------+-----------------
00 | 00 00 00 30 e3 c3 67 db bc 61 c6 c3 56 4c ee 0a  | ...0..g..a..VL..
10 | 79 64 ad 6e 3f a9 0d 5d 48 cf 69 75 42 48 fa b6  | yd.n?..]H.iuBH..
20 | 6d 7f 3b 47 e9 2e 83 a0 67 14 52 2c 3b 8b c1 8e  | m.;G....g.R,;...
30 |                                                  | 
---- skip 56 bytes----                                                                               
                                                                                                     
                                                  
09:12:55.025299 <0.975.0> Received plain text bytes (shown after decryption):
<<98,0,0,0,1,0,0,0,11,101,120,105,116,45,115,105,103,110,97,108,0,0,0,0,4,80,
  73,80,69,0,0,0,0,0,0,0,0,0>>
                                                  
09:12:55.025307 <0.975.0> Received SSH_MSG_CHANNEL_REQUEST:
  recipient_channel: 1
  request_type: "exit-signal"
  want_reply: false                                                                                  
  data: <<0,0,0,4,80,73,80,69,0,0,0,0,0,0,0,0,0>>                                                    
                                                                                                     
                                                                                                     
09:12:55.025364 <0.975.0> Disconnecting:                                                             
 Module = ssh_connection_handler, Line = 833, StateName = {connected,client},
 Code = 11, Reason = "By application",
 DetailedText =                                                                                      
 "Internal error: error:{badrecord,undefined}"

09:12:55.025399 <0.975.0> Going to send SSH_MSG_DISCONNECT:
  code: 11
  description: "By application"
  language: undefined


09:12:55.025405 <0.975.0> Going to send plain text bytes (shown before encryption):
<<1,0,0,0,11,0,0,0,14,66,121,32,97,112,112,108,105,99,97,116,105,111,110,0,0,0,
  0>>

09:12:55.025455 <0.980.0> Server Channel (Id=0, CB=netconf_ssh_client) got info:
Info: {ssh_cm,<0.975.0>,{closed,0}}


09:12:55.025637 <0.975.0> Connection Terminating:
Reason: {shutdown,"By application"}, StateName: {connected,client}
Peer=127.0.0.1:22, Local=127.0.0.1:57612, User="admin"

After applying this fix:

diff --git a/lib/ssh/src/ssh_connection.erl b/lib/ssh/src/ssh_connection.erl
index 12ccc66..90fb46f 100644
--- a/lib/ssh/src/ssh_connection.erl
+++ b/lib/ssh/src/ssh_connection.erl
@@ -1025,15 +1025,18 @@ handle_msg(#ssh_msg_channel_request{recipient_channel = ChannelId,
       ?BOOLEAN(_Core), 
       ?DEC_BIN(Err, _ErrLen),
       ?DEC_BIN(Lang, _LangLen)>> = Data,
-    Channel = ssh_client_channel:cache_lookup(Cache, ChannelId),
-    RemoteId =  Channel#channel.remote_id,
-    {Reply, Connection} =  reply_msg(Channel, Connection0, 
-                                    {exit_signal, ChannelId,
-                                     binary_to_list(SigName),
-                                     binary_to_list(Err),
-                                     binary_to_list(Lang)}),
-    CloseMsg = channel_close_msg(RemoteId),
-    {[{connection_reply, CloseMsg}|Reply], Connection};
+    case ssh_client_channel:cache_lookup(Cache, ChannelId) of
+         #channel{remote_id = RemoteId} = Channel ->
+           {Reply, Connection} =  reply_msg(Channel, Connection0, 
+                                            {exit_signal, ChannelId,
+                                             binary_to_list(SigName),
+                                             binary_to_list(Err),
+                                             binary_to_list(Lang)}),
+            CloseMsg = channel_close_msg(RemoteId),
+            {[{connection_reply, CloseMsg}|Reply], Connection};
+       _ ->
+           {[], Connection0}
+    end;

My long running request runs uninterrupted until its successful completion (as expected).

etnt avatar Oct 15 '24 09:10 etnt

I guess there is some race condition in your case emptying cache earlier. Can you provide a bit more of dbg out to illustrate better what happens before the error?

What are the conditions when this error happens? Is it 100% reproducible?

Thanks for code change proposal, we will test it now.

u3s avatar Oct 28 '24 07:10 u3s

Well, it isn't easy to reproduce. I've got a reproduction setup involving customer data but that only reproduce the problem when running in a Docker container (running Ubuntu 22.04) on my local Linux machine (also running Ubuntu 22.04). But when running straight on my local Linux machine, I can't reproduce it. Note that in my Docker container the fault occurs 100%, i.e always.

I tried to write a minimal example, without using our product at all, see: https://github.com/etnt/erl-ssh-channels ,but that did not reproduce the problem.

I've introduced my patch in our three forked OTP repos; we are using our own forks of OTP-22/25/27. And it passes all our, quite extensive, product tests.

I can try and produce the full output from running ssh_dbg:on() but if that contain (hints) of customer data I can't attach it here.

etnt avatar Oct 29 '24 08:10 etnt

Thanks for providing more details, I think it is enough. I was just curious because it looked like a race condition and a situation when channel is closed and removed from cache earlier.

u3s avatar Oct 29 '24 09:10 u3s

related PR is merged. fix will be released in next emergency patches for supported versions. thanks for the report!

u3s avatar Oct 31 '24 09:10 u3s