sshj
sshj copied to clipboard
False-alarm timeout exception when waiting for key exchange to complete
Occasionally, my sshj client hits a timeout exception almost as soon as the main thread starts waiting for the key exchange to complete, even though the timeout has not been reached.
Based on the logs and code, this can happen during the following sequence of events between the main thread and the reader thread:
[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.handle] Received SSH_MSG_KEXINIT
[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.startKex] Initiating key exchange
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][Test worker][Promise.tryRetrieve] Awaiting <<kex done>>
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][sshj-Reader-localhost/127.0.0.1:22-1699307992560][Promise.deliver] Setting <<kex done>> to `null`
This is a race condition. What happens under the covers is:
- The key exchange is initiated by the reader thread when it receives
SSH_MSG_KEXINIT
from the server before the main thread has a chance to initiate that exchange. - Before the reader thread calls
done.clear()
inKeyExchanger.startKex
, the main thread also callsKeyExchanger.startKex
, skips the key exchange initiation (because it's already ongoing) and callsdone.await
with a timeout to wait for the exchange to complete. -
done.await
callsPromise.retrieve
which callsPromise.tryRetrieve
, which waits for the associated condition viacond.await
. - The reader thread calls
done.clear()
, which delivers a value ofnull
to thedone
promise. - This causes
cond.await
to wake up the main thread, andPromise.tryRetrieve
returns thisnull
value. -
Promise.retrieve
throws a timeout exception because the retrieved value isnull
, even though there wasn't a timeout.
Please disregard the mention above from PR https://github.com/hierynomus/sshj/pull/911. I referenced the wrong issue there (I meant issue #910, not #909).
By the way, I'm looking into the code to see if I can propose a fix and raise a PR for #909.