mina-sshd
mina-sshd copied to clipboard
SshChannelNotFoundException: Received SSH_MSG_CHANNEL_WINDOW_ADJUST on unassigned channel 0 (last assigned=null)
Version
2.8.0
Bug description
We received a report from a Gerrit user at review.opendev.org who could not pull their repositories via SSH. Upon investigating, we could see a consistent error logged for them
SshChannelNotFoundException: Received SSH_MSG_CHANNEL_WINDOW_ADJUST on unassigned channel 0 (last assigned=null)
and then the connection appears to disconnect. This user was using openssh 9.1_p1-r3
This issue seems to have some history with Gerrit. It seems in [1] there was a report of high-latency connections causing a similar-but-different error after an upgrade to 1.7.0 or maybe 2.0.0
SshChannelNotFoundException: Received SSH_MSG_CHANNEL_WINDOW_ADJUST on unknown channel 0
If I'm understanding what happened here, there was no particular root cause found as to why this error was occurring. However, what happened was that a ChannelIdTrackingUnknownChannelReferenceHandler
implementation was added to sshd-contrib
and incorporated into gerrit with [2].
If I'm understanding what this does, it basically watches when channels are initalized and saves that channel number to a session variable LAST_CHANNEL_ID_KEY
. It then seems to basically ignore the ChannelIdTrackingUnknownChannelReference
error if that channel has ever been opened before (i.e. the channel raising the exception is < LAST_CHANNEL_ID_KEY, and assuming channels are opened sequentially).
I think the summary at that point might be "we don't really know why we're seeing adjustment messages for unassigned channels, but we're ok to ignore that if we know the channel was opened at some point"?
Gerrit enabled this by default, but left an undocumented escape-hatch of an enableChannelIdTracking
flag to turn it off; i.e. go back to the default state of raising an error for any messages for unassigned channels [3].
I think what we can see from this current error is that the client has sent this window adjustment message when mina seems to think no channel has ever been opened -- since the last assigned is null (on unassigned channel 0 (last assigned=null)
).
This seems quite weird, and possibly racy? I confess only a passing knowledge of the SSH protocol, but how would the remote end have thought that the channel was setup enough to send a window adjustment when the mina side appears to have never have made the call to the channelInitialized()
function here?
I should note this was again debugged to a high-latency, possibly unreliable connection. The user tried both ipv4 and ipv6 and could replicate the issue. When the switched to tethering via their phone, the problem did not occur. But it does seem to me that tcp/ip should keep what is coming across the wire in-order ...
Once identified by this user, upon inspecting the logs we noticed there were more connections exhibiting this behaviour. It seems to be heavily skewed to a few users that seem to have a lot of problems, but then we have this same message occurring once or twice for many more users.
I guess the question is -- is this a symptom of a connection that is so out of order with packet loss etc. that it can not be recovered; and this is just the first error that it happens to hit; or is this possibly some sort of race, where if this race didn't happen, the connection could be completed, even if slowly?
[1] https://issues.apache.org/jira/browse/SSHD-942 [2] https://gerrit-review.googlesource.com/c/gerrit/+/238384/9/java/com/google/gerrit/sshd/ChannelIdTrackingUnknownChannelReferenceHandler.java [3] https://gerrit-review.googlesource.com/c/gerrit/+/238384/9/java/com/google/gerrit/sshd/SshDaemon.java
Actual behavior
Connection closed unexpectedly
Expected behavior
Connection to work
Relevant log output
No response
Other information
No response
Ian: I don't know.
IIRC, openssh opens channels with an initial window size of zero, and only later, when the channel is set up, sends an initial window adjustment. That way it can be sure it doesn't get data while it is still setting up the channel. On receiving the SSH_MSG_CHANNEL_OPEN, the Apache MINA sshd server creates and registers its channel object, and then sends back a SSH_MSG_CHANNEL_OPEN_CONFIRMATION. By that time, the channel ID is known server side, and is registered in the session server side, and the openssh client should definitely not send its window adjustment before it gets that open confirmation.
Packet loss seems an unlikely cause. If the SSH_MSG_CHANNEL_OPEN were lost, the server wouldn't do anything and never send back an open confirmation, so openssh should never send a window adjustment. Likewise if the open confirmation was lost.
Could it be possible that there is a race between the server not receiving anything for some time, hitting some timeout and closing the session on that timeout, but just then getting the window adjustment after already having closed the previously opened channel? Perhaps Gerrit could also log channelClosed()
events in its ChannelIdTrackingUnknownChannelReferenceHandler
? I do notice that LAST_CHANNEL_ID_KEY
is public -- are we 100% sure that no other code removes this key? When a session is closed, the attributes are cleared; so that might be indeed an indication that the session is already closed when the window adjustment is about to be handled.
Debug logs might tell more. (And confirm or debunk that theory.)
Thanks for looking!
Could it be possible that there is a race between the server not receiving anything for some time, hitting some timeout and closing the session on that timeout, but just then getting the window adjustment after already having closed the previously opened channel? Perhaps Gerrit could also log
channelClosed()
events in itsChannelIdTrackingUnknownChannelReferenceHandler
?
Does the change https://gerrit-review.googlesource.com/c/gerrit/+/357694 look about right to you? We can run with something like this (it will take me a bit to patch and deploy it, but we are seeing it regularly, so I think we have a high chance of hitting it).
I do notice that
LAST_CHANNEL_ID_KEY
is public -- are we 100% sure that no other code removes this key?
I would say as close to 100% as possible :) Certainly there's nothing else touching it from inspecting the gerrit source I can see.
Debug logs might tell more. (And confirm or debunk that theory.)
I can turn up com.google.gerrit.sshd.ChannelIdTrackingUnknownChannelReferenceHandler: INFO
-- I think that needs the above change to be helpful and log the other events. Is there any other handler you'd suggest helpful too?
Left a +1 and comment in that Gerrit change. If we can get these log messages, that's good enough for starters.
After quite a bit of head scratching over why none of the messages in https://gerrit-review.googlesource.com/c/gerrit/+/357694 were appearing, I believe that the original Gerrit change https://gerrit-review.googlesource.com/c/gerrit/+/238384 didn't also register the unknown-reference handler as a regular channel-listener.
This is explained in the docs @ https://github.com/apache/mina-sshd/blob/master/docs/event-listeners.md#unknownchannelreferencehandler
I have prepared a gerrit change that fixes this with https://gerrit-review.googlesource.com/c/gerrit/+/358314
I think this explains the null
in SshChannelNotFoundException: Received SSH_MSG_CHANNEL_WINDOW_ADJUST on unassigned channel 0 (last assigned=null)
. The channelInitialized
callback was never made, and so the last-assigned value for the the channel was never set, and the workaround of having the channel messages ignored if the channel has been opened is never applied.
ISTM the result of this is we were probably really just seeing the original issue, just with a different log message?
I've put the debugging comments change https://gerrit-review.googlesource.com/c/gerrit/+/357694 into WIP state, as I think we probably want to try running for a while with the workaround actually applied.