otp
otp copied to clipboard
SSH: missing entry in channel cache
https://github.com/erlang/otp/blob/6d4731b9b5c70686bfaea67f2cd6f4f912e6da06/lib/ssh/src/ssh_connection.erl#L1029
Look at the line above where the code try to access: Channel#channel.remote_id
I seem to have a case where Channel turns up as undefined , hence causing a crash here.
I prepared the ssh_connection.erlcode like this:
handle_msg(#ssh_msg_channel_request{recipient_channel = ChannelId,
request_type = "exit-signal",
want_reply = false,
data = Data},
#connection{channel_cache = Cache} = Connection0, _) ->
<<?DEC_BIN(SigName, _SigLen),
?BOOLEAN(_Core),
?DEC_BIN(Err, _ErrLen),
?DEC_BIN(Lang, _LangLen)>> = Data,
Channel = ssh_client_channel:cache_lookup(Cache, ChannelId),
io:format(">>> ~p(~p): ChannelId=~p , Channel=~p~n",[?MODULE,?LINE,ChannelId,Channel]),
io:format(">>> ~p(~p): CHANNEL-CACHE:~n~p~n",[?MODULE,?LINE,ets:tab2list(Cache)]),
RemoteId = Channel#channel.remote_id,
io:format(">>> ~p(~p): RemoteId=~p~n",[?MODULE,?LINE,RemoteId]),
{Reply, Connection} = reply_msg(Channel, Connection0,
{exit_signal, ChannelId,
binary_to_list(SigName),
binary_to_list(Err),
binary_to_list(Lang)}),
CloseMsg = channel_close_msg(RemoteId),
{[{connection_reply, CloseMsg}|Reply], Connection};
When running, I see this:
>>> ssh_connection(605): ChannelId=1 , Channel=undefined
>>> ssh_connection(606): CHANNEL-CACHE:
[{channel,"session","subsystem",<0.852.0>,undefined,0,647044,8316,32768,false,
0,2096622,32768,false,
{[],[]}}]
Note: the Channel is undefined , so it seems like the code is crashing at the record reference of the remote_id since we don't get the last printout of the RemoteId.
A little about my setup, which is a bit complicated...
- I'm setting up an SSH connection from one Erlang node to another (well not quite, see below)
- The connection is not "direct"; it is setup toward an external OpenSSH server and we are running a NETCONF subsystem module (that in its turn talks TCP toward the other Erlang node).
- We make a Long Running request over the connection on Channel(0) and then a Short Running request over the same connection using Channel(1). It is when the Short Running request is finished this happen causing the Channel(0) to also be closed thus interrupting the Long Running request.
It's kind of a bad description perhaps but anyway, this fact that the code can crash if nothing is found in the Channel Cache doesn't seem good, or?
I can add that in the body of that exit-signal message I get "PIPE" which corresponds to one of the signal names in RFC-4254.
yes. code looks like it might crash there.
but I would like to understand more your case
It is when the Short Running request is finished this happen causing the Channel(0) to also be closed thus interrupting the Long Running request
Is this expected? Why this happens?
Can you collect and provide ssh_dbg traces for that situation?
You can enable interesting content it with:
ssh_dbg:on([channels, channel_event])
or ssh_dbg:on() to collect all dbg output (if it does not produce too much data)
To answer your first question. The Long Running request over Cannel(0) should not be terminated like this. It should be able to finish on its own.
Ok, and I forgot to mention, I'm running OTP-22:
# erl
Erlang/OTP 22 [erts-10.7.2.9] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]
But that Cache lookup code looks the same as in the latest OTP, or?
Anyway, I ran ssh_dbg:on() which produced a lot of output, but here is an interesting part of it; note that internal error:
16:07:30.101501 <0.855.0> Received SSH_MSG_CHANNEL_REQUEST:
recipient_channel: 1
request_type: "exit-signal"
want_reply: false
data: <<0,0,0,4,80,73,80,69,0,0,0,0,0,0,0,0,0>>
16:07:30.101503 <0.855.0> Connection event result
{keep_state,"#data{}",
[{next_event,internal,prepare_next_packet},
{next_event,internal,
{conn_msg,
{ssh_msg_channel_request,1,"exit-signal",false,
<<0,0,0,4,80,73,80,69,0,0,0,0,0,0,0,0,0>>}}}]}
16:07:30.101528 <0.855.0> Connection event
EventType: internal
EventContent: prepare_next_packet
State: {connected,client}
16:07:30.101559 <0.855.0> Connection event result
keep_state_and_data
16:07:30.101563 <0.855.0> Connection event
EventType: internal
EventContent: {conn_msg,
{ssh_msg_channel_request,1,"exit-signal",false,
<<0,0,0,4,80,73,80,69,0,0,0,0,0,0,0,0,0>>}}
State: {connected,client}
16:07:30.101804 <0.855.0> Disconnecting:
Module = ssh_connection_handler, Line = 1186, StateName = {connected,client},
Code = 11, Reason = "By application",
DetailedText =
"Internal error: error:{badrecord,channel}"
16:07:30.101835 <0.855.0> Going to send SSH_MSG_DISCONNECT:
code: 11
description: "By application"
language: undefined
16:07:30.101843 <0.855.0> Going to send plain text bytes (shown before encryption):
<<1,0,0,0,11,0,0,0,14,66,121,32,97,112,112,108,105,99,97,116,105,111,110,0,0,0,
0>>
16:07:30.101912 <0.858.0> Server Channel (Id=0, CB=netconf_ssh_client) got info:
Info: {ssh_cm,<0.855.0>,{closed,0}}
fyi: I made a quick fix, ignoring the cache miss, then my code seem to run as expected:
handle_msg(#ssh_msg_channel_request{recipient_channel = ChannelId,
request_type = "exit-signal",
want_reply = false,
data = Data},
#connection{channel_cache = Cache} = Connection0, _) ->
<<?DEC_BIN(SigName, _SigLen),
?BOOLEAN(_Core),
?DEC_BIN(Err, _ErrLen),
?DEC_BIN(Lang, _LangLen)>> = Data,
Channel = ssh_client_channel:cache_lookup(Cache, ChannelId),
io:format(">>> ~p(~p): ChannelId=~p , Channel=~p~n",[?MODULE,?LINE,ChannelId,Channel]),
io:format(">>> ~p(~p): CHANNEL-CACHE:~n~p~n",[?MODULE,?LINE,ets:tab2list(Cache)]),
case Channel of
#channel{remote_id = RemoteId} ->
io:format(">>> ~p(~p): RemoteId=~p~n",[?MODULE,?LINE,RemoteId]),
{Reply, Connection} = reply_msg(Channel, Connection0,
{exit_signal, ChannelId,
binary_to_list(SigName),
binary_to_list(Err),
binary_to_list(Lang)}),
CloseMsg = channel_close_msg(RemoteId),
{[{connection_reply, CloseMsg}|Reply], Connection};
_ ->
%% Ignore the Cache miss!
{[], Connection0}
end;
can you verify the behavior with current OTP-27 version (or at least OTP-26, OTP-25)? this would allow me to increase priority of the issue.
can you verify the behavior with current OTP-27 version (or at least OTP-26, OTP-25)? this would allow me to increase priority of the issue.
Ok, I'll try.
Ok, I have now tried it with this OTP version:
# erl
Erlang/OTP 27 [erts-15.1.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
Running with ssh_dbg:on() I see this:
09:12:55.025229 <0.975.0> TCP stream data arrived
State: {connected,client}
Socket: #Port<0.63>
TcpData:
Ad | Hexdump | ASCII
---+--------------------------------------------------+-----------------
00 | 00 00 00 30 e3 c3 67 db bc 61 c6 c3 56 4c ee 0a | ...0..g..a..VL..
10 | 79 64 ad 6e 3f a9 0d 5d 48 cf 69 75 42 48 fa b6 | yd.n?..]H.iuBH..
20 | 6d 7f 3b 47 e9 2e 83 a0 67 14 52 2c 3b 8b c1 8e | m.;G....g.R,;...
30 | |
---- skip 56 bytes----
09:12:55.025299 <0.975.0> Received plain text bytes (shown after decryption):
<<98,0,0,0,1,0,0,0,11,101,120,105,116,45,115,105,103,110,97,108,0,0,0,0,4,80,
73,80,69,0,0,0,0,0,0,0,0,0>>
09:12:55.025307 <0.975.0> Received SSH_MSG_CHANNEL_REQUEST:
recipient_channel: 1
request_type: "exit-signal"
want_reply: false
data: <<0,0,0,4,80,73,80,69,0,0,0,0,0,0,0,0,0>>
09:12:55.025364 <0.975.0> Disconnecting:
Module = ssh_connection_handler, Line = 833, StateName = {connected,client},
Code = 11, Reason = "By application",
DetailedText =
"Internal error: error:{badrecord,undefined}"
09:12:55.025399 <0.975.0> Going to send SSH_MSG_DISCONNECT:
code: 11
description: "By application"
language: undefined
09:12:55.025405 <0.975.0> Going to send plain text bytes (shown before encryption):
<<1,0,0,0,11,0,0,0,14,66,121,32,97,112,112,108,105,99,97,116,105,111,110,0,0,0,
0>>
09:12:55.025455 <0.980.0> Server Channel (Id=0, CB=netconf_ssh_client) got info:
Info: {ssh_cm,<0.975.0>,{closed,0}}
09:12:55.025637 <0.975.0> Connection Terminating:
Reason: {shutdown,"By application"}, StateName: {connected,client}
Peer=127.0.0.1:22, Local=127.0.0.1:57612, User="admin"
After applying this fix:
diff --git a/lib/ssh/src/ssh_connection.erl b/lib/ssh/src/ssh_connection.erl
index 12ccc66..90fb46f 100644
--- a/lib/ssh/src/ssh_connection.erl
+++ b/lib/ssh/src/ssh_connection.erl
@@ -1025,15 +1025,18 @@ handle_msg(#ssh_msg_channel_request{recipient_channel = ChannelId,
?BOOLEAN(_Core),
?DEC_BIN(Err, _ErrLen),
?DEC_BIN(Lang, _LangLen)>> = Data,
- Channel = ssh_client_channel:cache_lookup(Cache, ChannelId),
- RemoteId = Channel#channel.remote_id,
- {Reply, Connection} = reply_msg(Channel, Connection0,
- {exit_signal, ChannelId,
- binary_to_list(SigName),
- binary_to_list(Err),
- binary_to_list(Lang)}),
- CloseMsg = channel_close_msg(RemoteId),
- {[{connection_reply, CloseMsg}|Reply], Connection};
+ case ssh_client_channel:cache_lookup(Cache, ChannelId) of
+ #channel{remote_id = RemoteId} = Channel ->
+ {Reply, Connection} = reply_msg(Channel, Connection0,
+ {exit_signal, ChannelId,
+ binary_to_list(SigName),
+ binary_to_list(Err),
+ binary_to_list(Lang)}),
+ CloseMsg = channel_close_msg(RemoteId),
+ {[{connection_reply, CloseMsg}|Reply], Connection};
+ _ ->
+ {[], Connection0}
+ end;
My long running request runs uninterrupted until its successful completion (as expected).
I guess there is some race condition in your case emptying cache earlier. Can you provide a bit more of dbg out to illustrate better what happens before the error?
What are the conditions when this error happens? Is it 100% reproducible?
Thanks for code change proposal, we will test it now.
Well, it isn't easy to reproduce. I've got a reproduction setup involving customer data but that only reproduce the problem when running in a Docker container (running Ubuntu 22.04) on my local Linux machine (also running Ubuntu 22.04). But when running straight on my local Linux machine, I can't reproduce it. Note that in my Docker container the fault occurs 100%, i.e always.
I tried to write a minimal example, without using our product at all, see: https://github.com/etnt/erl-ssh-channels ,but that did not reproduce the problem.
I've introduced my patch in our three forked OTP repos; we are using our own forks of OTP-22/25/27. And it passes all our, quite extensive, product tests.
I can try and produce the full output from running ssh_dbg:on() but if that contain (hints) of customer data I can't attach it here.
Thanks for providing more details, I think it is enough. I was just curious because it looked like a race condition and a situation when channel is closed and removed from cache earlier.
related PR is merged. fix will be released in next emergency patches for supported versions. thanks for the report!