libzmq icon indicating copy to clipboard operation
libzmq copied to clipboard

zeromq server has 1 min delay when client connects then disconnects and then connects back in quick succession

Open neilbhere opened this issue 3 years ago • 2 comments

Please use this template for reporting suspected bugs or requests for help.

Issue description

I am using zeromq-4.0.1 source code. In production, there is 1 server and multiple clients. The connection between these switches are via router socket. Because of link flap, the client disconnects from server and after recovery client connects back. However, during connect, client gets a new fd. The server side code, never receives ZMQ_EVENT_DISCONNECTED directly (may be because of network issue due to link flap, client's FIN does not reach server). Instead, server receives ZMQ_EVENT_ACCEPTED event for the client but with a different fd. This where the delay starts. So at this point, there are two fds for the same client. The new fd is not operational. After sometime client receives the below message for the old fd.

"ZmQ: int zmq::stream_engine_t::read(void*, size_t):923 Stream engine recv(): TCP socket (187) to unknown:0 was disconnected with error 107 [Transport endpoint is not connected]"

I am using zeromq-4.0.1 source code. In production, there is 1 server and multiple clients. The connection between these switches are via router socket. Because of link flap, the client disconnects from server and after recovery client connects back. However, during connect, client gets a new fd. The server side code, never receives ZMQ_EVENT_DISCONNECTED directly (may be because of network issue due to link flap, client's FIN does not reach server). Instead, server receives ZMQ_EVENT_ACCEPTED event for the client but with a different fd. This where the delay starts. So at this point, there are two fds for the same client. The new fd is not operational. After sometime client receives the below message for the old fd.

Environment

x86 ititanium 32 bit

  • libzmq version (commit hash if unreleased): zeromq-4.0.1

  • OS: windriver linux

Minimal test code / Steps to reproduce the issue

  1. Server receives ACCEPTED event for clientY and gets FD1.
  2. Link-flap/network issue happens and clientY disconnects but server does not receive this disconnect.
  3. Network recovers and clientY connects back to server.
  4. Server receives ACCEPTED event for clientY and gets FD2
  5. However, packets sent to this sockets does not go out of the server.
  6. After 1 min or so, clientY receives "Transport endpoint is not connected error" for FD1.
  7. After this, FD2 becomes active.

What's the actual result? (include assertion message & call stack if applicable)

"ZmQ: int zmq::stream_engine_t::read(void*, size_t):923 Stream engine recv(): TCP socket (187) to unknown:0 was disconnected with error 107 [Transport endpoint is not connected]"

"ZmQ: virtual void zmq::router_t::xpipe_terminated(zmq::pipe_t*):159 xterminating pipe 0x1aeea7a0 (0d010110)"

What's the expected result?

As you can see that only after step 6, new fd becomes operational. Sometime this disconnect event takes more than 1 min to reach application code which causes delay. Is it a known issue in zeromq? Is there a way to handle this scenario? The expected result is for the new fd to become immediately available or there should be a for the server to handle the old fd. There should not be this 1 min delay.

neilbhere avatar Oct 31 '20 19:10 neilbhere

Do I need to any other details? If you have faced similar issue and know a solution then please share.

neilbhere avatar Nov 03 '20 14:11 neilbhere

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 18:04 stale[bot]