sshj icon indicating copy to clipboard operation
sshj copied to clipboard

Fix false-alarm timeout exception when waiting for key exchange to complete

Open rasantel opened this issue 1 year ago • 0 comments

Problem

As described in https://github.com/hierynomus/sshj/issues/909, the SSHJ client may hit a timeout exception almost as soon as the main thread starts waiting for the key exchange to complete, even though the timeout has not been reached.

Why isn't this issue more widespread? I don't know :) I started hitting it when I enabled the heartbeater by setting the keepalive interval just before starting a connection.

Diagnosis

This issue is an existing race condition that doesn't actually require the heartbeater to be running, but apparently the presence of the heartbeater thread (at least on my production Ubuntu 20.04 linux systems) creates the conditions for this particular interleaving to be more likely to occur.

Based on the SSHJ logs and code, the issue can happen during the following sequence of events between the main thread and the reader thread:

[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.handle] Received SSH_MSG_KEXINIT
[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.startKex] Initiating key exchange
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][Test worker][Promise.tryRetrieve] Awaiting <<kex done>>
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][sshj-Reader-localhost/127.0.0.1:22-1699307992560][Promise.deliver] Setting <<kex done>> to `null`

What happens under the covers is:

  1. The key exchange is initiated by the reader thread when it receives SSH_MSG_KEXINIT from the server before the main thread has a chance to initiate that exchange.
  2. Before the reader thread calls done.clear() in KeyExchanger.startKex, the main thread also calls KeyExchanger.startKex, skips the key exchange initiation (because it's already ongoing) and calls done.await with a timeout to wait for the exchange to complete.
  3. done.await calls Promise.retrieve which calls Promise.tryRetrieve, which waits for the associated condition via cond.await.
  4. The reader thread calls done.clear(), which delivers a value of null to the done promise.
  5. This causes cond.await to wake up the main thread, and Promise.tryRetrieve returns this null value.
  6. Promise.retrieve throws a timeout exception because the retrieved value is null, even though there wasn't a timeout.

I can consistently reproduce this issue by attaching a debugger in order to force this particular thread interleaving. I set a breakpoint at the entrance of KeyExchanger.startKex, let both the main thread and the reader hit it, resume the reader thread (so it will initiate the exchange) but stop it right before done.clear, resume the main thread which will pause at Promise.retrieve, and then resume the reader so it calls done.clear, which delivers a null to Promise.retrieve on the main thread and thus triggers the exception.

Solution

One option would be to add extra synchronization to KeyExchanger.startKex to ensure that waitForDone, which calls done.await, is not called before done.clear.

The other option I see, which avoids extra synchronization and its drawbacks, is to think of Promise.clear as just the act of clearing the value to null, without actually "delivering" that null value to anyone waiting on that promise. This is what I am proposing in the current diff.

The only concern with doing the latter could be any side effects it may have on other uses of Promise.clear. I inspected the code and found one other use of clear, in UserAuthImpl.authenticate. In that method, clear is called before retrieve, so the call to retrieve will not be affected by this change in clear. Presumably, all calls to authenticate should happen in the same thread, but if somehow an application calls it from different threads, then this change fixes the same issue that we target in this PR for the key exchange; parallel calls to authenticate will no longer throw a spurious timeout exception in such an unlikely use case. The only other place where Promise.clear is called is Event.clear, and an inspection of the uses of Event.clear results in similar conclusions: calls to clear cannot occur while the event is waiting on its promise.

Testing

I repeated the same procedure that reproduces the bug. With this fix applied, the main thread at Promise.retrieve no longer receives a null from done.clear from the reader thread, and instead is awakened later when the reader actually delivers a non-null value via done.set, thus avoiding the exception.

rasantel avatar Nov 10 '23 21:11 rasantel