Micro-XRCE-DDS-Agent
Micro-XRCE-DDS-Agent copied to clipboard
Fixed various TCP-related deadlocks and improved disconnection mechanism
- TCPServer::listener_loop() not disconnecting failed/disconnected sockets, leaving the socket unusable.
- Fixed TCPServer::recv_locking() and TCPServer::send_locking() not indicating the non-active connection state as an error, which previously resulted in the caller entering an infinite loop.
- Synchronize the TCP stream with the client before closing the socket, to avoid possibly losing bytes in flight.
(3) revolves around the fact that send() does not guarantee that the receiver will receive all data. By waiting for the other side (the agent in this case) to also close the connection, we can be sure that all data that should have been received, was received. This also solves another problem: under such a condition, calling close() on the socket() will result in the remote end sending a TCP RST in response to the FIN-ACK.
(1) can happen if the client closes the connection before WSAPoll() or poll() is invoked. Since POLLERR or POLLHUP will be set as the socket status under such a condition and these status bits are not recognized by the agent, the socket descriptor becomes permanently lost.
(2) started happening after (1) was solved. If the sending or receiving operation closes the connection, then a deadlock could be caused by the other operation as the non-active connection state did not result in the error flag getting set.
(3) revolves around the fact that send() does not guarantee that the remote side receives the data. Calling close() from the remote end before all data is received, can result in a TCP RST getting sent in response to the FIN-ACK message by the side initiating the closure (typically the client). This could also prevent a loss of data by ensuring that both ends have entered the disconnection process, since the agent side must shut down the sending too, before the client will call close().
I hope this explanation would be better than my original attempt.
I found that I made a mistake in the Linux TCP server code, which rendered it uncompilable. I have fixed that mistake and also split apart the two set of changes into two commits.
I have also added one more patch for setting SO_REUSEADDR on the server listening socket, to allow the Agent to bind to its listening port after being restarted. I am not sure if your team prefers this behaviour, but it is something desired by mine (to have near zero downtime).
If this is not desired, please feel free to let me know and I shall drop the commit.
The build error was "ERROR: Couldn't find any revision to build. Verify the repository and branch configuration for this job."
@sp193 thanks, We will look at this.
Hi @sp193. We want to merge this work in our develop branch. Did you check @julianbermudez comments?.
Hi,
Thanks for reviewing my work. I have been very overwhelmed by work, which is why I have been postponing work on this. I have attempted to make the changes as requested.
Please review the similar patch to the Client as well, otherwise the disconnection procedure would not really be improved on: https://github.com/eProsima/Micro-XRCE-DDS-Client/pull/97
The logic here is that the client should attempt to wait for the agent to actively start on the shutdown process, before it closes the TCP connection. Otherwise, the connection may come down on bytes in flight, which may cause the OS to deem the connection as having an error and sends a TCP RST.
Hi @sp193,
Thanks so much to you for helping us to improve Micro XRCE. We need more people like you xD.
I've just reviewed the https://github.com/eProsima/Micro-XRCE-DDS-Client/pull/97.