distant Implement pluggable retry logic at raw transport level

Some types of retry to consider: https://docs.microsoft.com/en-us/azure/architecture/patterns/retry

EternalTerminal implemented a BackedReader and BackedWriter: https://eternalterminal.dev/howitworks/

Jun 27 '22 16:06 chipsenkbeil

Current thought is to have a trait that the raw and higher level transport traits will require to be implemented in the form of

#[async_trait]
pub trait Reconnectable {
    async fn reconnect(&mut self) -> io::Result<()>;
}

// Implementors need to support Read and Write halves performing a reconnect separately

In terms of how the separate read and write halves might do a reconnect, I can imagine there being an Arc<RwLock<...>> for the underlying transport type where it talks to a facilitator of reconnecting that will hand out the read and write halves once a connect has finished, possibly using a channel. It could keep track of when either read or write has asked for a reconnect such that it only does a reconnect if asked from the same half again, rather than the write asking for a reconnect following the read (where the write would not actually trigger a second reconnect).

References for what a Retryable interface could look like:

Spring's Retryable annotation
Rust Retry crate for logic
Tokio version of retry logic
General purpose Rust networking library, Tower, has retry support w/ a Policy trait

Sep 03 '22 08:09 chipsenkbeil

eternal terminal has a resume feature by having writers store the last 64 * 64 * 1024 bytes of data (in frames) and keeping a packet counter that is incremented alongside the reader in order to replay the last N frames of data.

Once the max frames is reached, whenever a new frame is added, the oldest X packets are dropped until the total stored size is under the maximum byte size.

The server looks up which to resume using a client id, where writers on both sides (server and client) will replay any remaining data at the same time.

To support this, we would need a handshake when transports first connect (or reconnect) to do the check on matching id and sending data to one another. The client is simple as the reconnect function would just swap out the reader and writer for new ones - we would need to throw out the old tasks and spawn new ones - and maintains the same client id.

The server would need to have some cache storage that maintains N frames per client. The tricky part, which I don't know if/how eternal terminal handles, is invalidating the cache. I think the easiest thing to do is similar to the shutdown logic where we have an async task that will delete a client id from the internal storage ONLY if has not received an active connection from that client within D duration.

This changes how transports work such that the codec used would need to support being resumable. My thought is that we don't need to change the plain/xchachapoly1305 codecs, but instead provide another codec that wraps an existing codec, stores the last N frames similar to eternal terminal, and implements a trait like Resumable that - when invoked - will switch the codec to listen for a sequence counter, replay the last N frames if there is a difference, and then switch back to usual processing.

From there, the transport traits would have their own Resumable variant that can be implemented to provide reconnect. The individual implementors such as TCPTransport, UnixSocketTransport, and WindowsPipeTransport would then implement to support swapping out their underlying sockets or read/write halves for new ones. Or if we're spawning tasks that use the readers and writers, we would need to maintain something else in the client & server that lets us recreate them.

Regardless, during reconnect, we would have an implementation that also triggers the codec feature if the codec implements that trait. Since we cannot use specialization yet to do that, we may just need to require Client and Server to only accept transports that have a codec that implements Resumable.

Sep 08 '22 23:09 chipsenkbeil

I think to make this clean, the transport interfaces need to be rewritten to use tokio's ready, is_readable, is_writeable, try_read, and try_write so we don't need to split the transports.

The reason for this is that it will simplify reconnect logic. It also will reduce how many structs we have and remove the separate read and write management complications.

Sep 10 '22 02:09 chipsenkbeil

Based on the rewrite of Transport and the removal of TypedTransport and UntypedTransport, we can pack this logic into FramedTransport. The struct will contain an incoming and outgoing Bytes buffer.

Similar to the loops described in tokio's codec, we will have try_read continue to fill the Bytes buffer in a loop until we either get a WouldBlock, an Ok(0), or a successfully-decoded frame. Likewise, for writing a frame, we will loop invoking try_write until we get WouldBlock, an Ok(0), or finished writing all of the bytes.

Because of the write logic above, we'd also want a flush to ensure that the remainder of the data is sent.

Sep 12 '22 22:09 chipsenkbeil

All of the above has been implemented except for the actual logic to retry. The reconnect method has been implemented and tested that will perform a handshake, re-authenticate, and synchronize state.

For the retry logic, it will live in distant-net::Client. I think having a heartbeat as an option for client/server in order to determine if a reconnect is needed would also be good. Otherwise, the client can receive a policy to determine how often it will retry once a failure (including disconnects) is detected in the main task loop. Rather than exiting, it will invoke the retry logic. The task should only exit once the client will no longer try to reconnect.

If the client is quitting, we want to cancel all futures found in our postoffice.

Oct 25 '22 06:10 chipsenkbeil

Also see https://thuc.space/posts/retry_strategies/

Oct 25 '22 06:10 chipsenkbeil

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Jul 06 '23 03:07 github-actions[bot]