magic-wormhole-transit-relay icon indicating copy to clipboard operation
magic-wormhole-transit-relay copied to clipboard

be more aggressive about closing TCP connections?

Open warner opened this issue 4 years ago • 0 comments

When looking at the server, I see lingering connections all the time. These are TCP connections that have been around for days, but are not moving any traffic. It's not really a problem, but it's weird, and I'd like to understand what's going on (and if it reflects some sort of bug in the client).

The relay will close one side as soon as it receives a close from the other side. So the only way to remain in this state is for our kernel to think that both sides are still connected. TCP has a notoriously long no-new-traffic timeout (at least hours, but maybe effectively infinite), and we don't send any sort of keepalives on this channel (and we cannot, since all the bytes are reserved for the two ends of the connection, and wormhole's Transit protocol doesn't do any keepalives (but the new Dilation protocol does)). So if both sides got partitioned (they closed their laptop before the wormhole process exits), the server might not see the connections close for a very long time.

If the sender closes their laptop before the transfer completes, the receiver will be left hanging (the progress bar showing only a partial transfer). If/when the recipient kills the program, the receiver's socket will be closed, the server will get a FIN, and the server will shut down the sender's socket too. If the recipient closes their laptop before killing the program, the server will see the sockets left open for a long time.

If the receiver closes their laptop first (before the transfer completes), the server's outgoing kernel TCP buffer will fill with unacked data, so the server will pause, so the server's incoming kernel buffer will fill, then the server's TCP stack will stop ACKing inbound data, then the sender's outgoing buffer will fill, then the sender will pause. The sender will see a partial progress bar, and no further progress being made. Looking at the server, I'd see non-empty kernel buffers for the connections (which I don't think I've ever seen, at least for connections that aren't making any progress at all). The server's kernel will retry the unacked outgoing TCP, and when those timeouts fail (which I think tends to be 5-10 minutes, maybe 15, but way shorter than the no-data-to-send case), the server will see a dropped connection, and will drop everything.

So I think the lingering connections I've observed must be from quiescent transfers, with no data being transferred at the time the partition happens. Or both sides are quiescent (transfer has finished) but they just forget to disconnect somehow.

Possible actions:

  • enable some TCP keepalive option, and hope it actually does something useful
  • have the server record how much data has been transferred (in each direction) on all sockets, at least temporarily, and find a way to correlate this with existing open sockets. So when I look at the server and see a lingering socket, I can find out whether the transfer hasn't started yet (zero bytes in both directions), has started/maybe-completed but is unacked (lots of bytes in one direction, zero bytes in the other), or has completed (lots of bytes in one direction, a small number for the ack in the other).
  • see if lingering relay connections are correlated with lingering mailbox connections, which might indicate clients that just forget to exit. The mailbox connections use websockets, which have their own ping/pong keepalive timeouts, so they'll tend to be closed more quickly in the event of partition
  • double-check that the new Dilation protocol sends periodic keepalives, and that it discovers partitions/laptop-closed in a timely fashion

warner avatar Aug 29 '19 18:08 warner