Micro-XRCE-DDS-Agent icon indicating copy to clipboard operation
Micro-XRCE-DDS-Agent copied to clipboard

Fixed various TCP-related deadlocks and improved disconnection mechanism

Open sp193 opened this issue 5 years ago • 12 comments

  1. TCPServer::listener_loop() not disconnecting failed/disconnected sockets, leaving the socket unusable.
  2. Fixed TCPServer::recv_locking() and TCPServer::send_locking() not indicating the non-active connection state as an error, which previously resulted in the caller entering an infinite loop.
  3. Synchronize the TCP stream with the client before closing the socket, to avoid possibly losing bytes in flight.

(3) revolves around the fact that send() does not guarantee that the receiver will receive all data. By waiting for the other side (the agent in this case) to also close the connection, we can be sure that all data that should have been received, was received. This also solves another problem: under such a condition, calling close() on the socket() will result in the remote end sending a TCP RST in response to the FIN-ACK.

sp193 avatar Jul 03 '19 11:07 sp193

(1) can happen if the client closes the connection before WSAPoll() or poll() is invoked. Since POLLERR or POLLHUP will be set as the socket status under such a condition and these status bits are not recognized by the agent, the socket descriptor becomes permanently lost.

(2) started happening after (1) was solved. If the sending or receiving operation closes the connection, then a deadlock could be caused by the other operation as the non-active connection state did not result in the error flag getting set.

(3) revolves around the fact that send() does not guarantee that the remote side receives the data. Calling close() from the remote end before all data is received, can result in a TCP RST getting sent in response to the FIN-ACK message by the side initiating the closure (typically the client). This could also prevent a loss of data by ensuring that both ends have entered the disconnection process, since the agent side must shut down the sending too, before the client will call close().

I hope this explanation would be better than my original attempt.

sp193 avatar Jul 03 '19 12:07 sp193

I found that I made a mistake in the Linux TCP server code, which rendered it uncompilable. I have fixed that mistake and also split apart the two set of changes into two commits.

sp193 avatar Aug 25 '19 04:08 sp193

I have also added one more patch for setting SO_REUSEADDR on the server listening socket, to allow the Agent to bind to its listening port after being restarted. I am not sure if your team prefers this behaviour, but it is something desired by mine (to have near zero downtime).

If this is not desired, please feel free to let me know and I shall drop the commit.

sp193 avatar Aug 25 '19 04:08 sp193

Build status:

  • Linux Build Status
  • Windows Build Status

richiware avatar Sep 30 '19 14:09 richiware

The build error was "ERROR: Couldn't find any revision to build. Verify the repository and branch configuration for this job."

sp193 avatar Oct 01 '19 00:10 sp193

@sp193 thanks, We will look at this.

BorjaOuterelo avatar Oct 01 '19 05:10 BorjaOuterelo

Build status:

  • Linux Build Status
  • Windows Build Status

richiware avatar Nov 12 '19 10:11 richiware

Hi @sp193. We want to merge this work in our develop branch. Did you check @julianbermudez comments?.

BorjaOuterelo avatar Nov 13 '19 15:11 BorjaOuterelo

Build status:

  • Linux Build Status
  • Windows Build Status

richiware avatar Nov 13 '19 16:11 richiware

Hi,

Thanks for reviewing my work. I have been very overwhelmed by work, which is why I have been postponing work on this. I have attempted to make the changes as requested.

Please review the similar patch to the Client as well, otherwise the disconnection procedure would not really be improved on: https://github.com/eProsima/Micro-XRCE-DDS-Client/pull/97

The logic here is that the client should attempt to wait for the agent to actively start on the shutdown process, before it closes the TCP connection. Otherwise, the connection may come down on bytes in flight, which may cause the OS to deem the connection as having an error and sends a TCP RST.

sp193 avatar Nov 13 '19 16:11 sp193

Hi @sp193,

Thanks so much to you for helping us to improve Micro XRCE. We need more people like you xD.

I've just reviewed the https://github.com/eProsima/Micro-XRCE-DDS-Client/pull/97.

julionce avatar Nov 18 '19 07:11 julionce

Build status:

  • Linux Build Status
  • Windows Build Status

richiware avatar Nov 19 '20 15:11 richiware