sshj
sshj copied to clipboard
Fix false-alarm timeout exception when waiting for key exchange to complete
Problem
As described in https://github.com/hierynomus/sshj/issues/909, the SSHJ client may hit a timeout exception almost as soon as the main thread starts waiting for the key exchange to complete, even though the timeout has not been reached.
Why isn't this issue more widespread? I don't know :) I started hitting it when I enabled the heartbeater by setting the keepalive interval just before starting a connection.
Diagnosis
This issue is an existing race condition that doesn't actually require the heartbeater to be running, but apparently the presence of the heartbeater thread (at least on my production Ubuntu 20.04 linux systems) creates the conditions for this particular interleaving to be more likely to occur.
Based on the SSHJ logs and code, the issue can happen during the following sequence of events between the main thread and the reader thread:
[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.handle] Received SSH_MSG_KEXINIT
[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.startKex] Initiating key exchange
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][Test worker][Promise.tryRetrieve] Awaiting <<kex done>>
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][sshj-Reader-localhost/127.0.0.1:22-1699307992560][Promise.deliver] Setting <<kex done>> to `null`
What happens under the covers is:
- The key exchange is initiated by the reader thread when it receives
SSH_MSG_KEXINIT
from the server before the main thread has a chance to initiate that exchange. - Before the reader thread calls
done.clear()
inKeyExchanger.startKex
, the main thread also callsKeyExchanger.startKex
, skips the key exchange initiation (because it's already ongoing) and callsdone.await
with a timeout to wait for the exchange to complete. -
done.await
callsPromise.retrieve
which callsPromise.tryRetrieve
, which waits for the associated condition viacond.await
. - The reader thread calls
done.clear()
, which delivers a value ofnull
to thedone
promise. - This causes
cond.await
to wake up the main thread, andPromise.tryRetrieve
returns thisnull
value. -
Promise.retrieve
throws a timeout exception because the retrieved value isnull
, even though there wasn't a timeout.
I can consistently reproduce this issue by attaching a debugger in order to force this particular thread interleaving. I set a breakpoint at the entrance of KeyExchanger.startKex
, let both the main thread and the reader hit it, resume the reader thread (so it will initiate the exchange) but stop it right before done.clear
, resume the main thread which will pause at Promise.retrieve
, and then resume the reader so it calls done.clear
, which delivers a null
to Promise.retrieve
on the main thread and thus triggers the exception.
Solution
One option would be to add extra synchronization to KeyExchanger.startKex
to ensure that waitForDone
, which calls done.await
, is not called before done.clear
.
The other option I see, which avoids extra synchronization and its drawbacks, is to think of Promise.clear
as just the act of clearing the value to null
, without actually "delivering" that null
value to anyone waiting on that promise. This is what I am proposing in the current diff.
The only concern with doing the latter could be any side effects it may have on other uses of Promise.clear
. I inspected the code and found one other use of clear
, in UserAuthImpl.authenticate
. In that method, clear
is called before retrieve
, so the call to retrieve
will not be affected by this change in clear
. Presumably, all calls to authenticate
should happen in the same thread, but if somehow an application calls it from different threads, then this change fixes the same issue that we target in this PR for the key exchange; parallel calls to authenticate
will no longer throw a spurious timeout exception in such an unlikely use case. The only other place where Promise.clear
is called is Event.clear
, and an inspection of the uses of Event.clear
results in similar conclusions: calls to clear
cannot occur while the event is waiting on its promise.
Testing
I repeated the same procedure that reproduces the bug. With this fix applied, the main thread at Promise.retrieve
no longer receives a null
from done.clear
from the reader thread, and instead is awakened later when the reader actually delivers a non-null value via done.set
, thus avoiding the exception.