phoenix icon indicating copy to clipboard operation
phoenix copied to clipboard

Fix reconnecting websockets on heartbeat timeout

Open laurglia opened this issue 1 year ago • 3 comments

In Google Chrome, when network connection is interrupted, the websocket connections break but they do not properly close. The following happens:

  • The error and close events are not triggered
  • Sending data with WebSocket#send works without errors even though the data is not actually sent
  • When calling WebSocket#close, the websocket is not closed. That is expected, as closing a websocket includes sending a closing handshake which cannot be done when the underlying connection is broken. The closing handshake is documented in the WebSockets Standard.

Because of the behavior described above, Phoenix failed to reconnect sockets with WebSocket transport. When heartbeat timed out, WebSocket#close was used to close the connection but since the connection was broken, sending the closing handshake failed, and the close event was not triggered.

It seems like that after some time, after around 2--3 minutes, the WebSocket finally closes (the close event triggers) but that is too long period for our use case.

To demonstrate the issue I have created a toy Phoenix project which sets up a WebSocket connection and displays Phoenix Socket logs on the page. The project is here: https://github.com/laurglia/phoenix-reconnection-issue

To reproduce the problem:

  1. Clone that repository
  2. Run the Phoenix server with mix phx.server in somewhere other than your local network
  3. Open the index route in Google Chrome
  4. Interrupt the network, for example, connect to another WiFi
  5. Observe that the Phoenix heartbeat fails and says that it is going to reconnect but in reality it does not immediately reconnect. You can use the "Ping" button on top of the page to also try to push a "ping" packet and observe that it is not sent.
  6. Only after around 1-3 minutes, the WebSocket connection is reestablished

To make testing easier, I have set up a server which runs the toy project. You can access the server at http://34.219.19.64:4000/. Here is also a video of me reproducing the issue: https://youtu.be/BP3rK06p0Ww

This commit fixes the issue by "tearing down" the WebSocket connection after heartbeat timeout instead of just attempting to close it. The teardown functionality does not necessarily close the connection but it will remove all references to it, allowing a new one to be created in place of the current one.

I also needed to make some changes to the teardown functionality. Previously, it was not necessary to remove listeners from the connections because once a connection was closed, it was not able for it to emit any events. However, now since the connection may actually stay open for some time and emit a close event after a few minutes, then that could cause problems if the connection was replaced with a new connection. That is why we now we set this.conn.onopen, this.conn.onerror and other listeners to no-ops when tearing down the connection.

On my test server, I am running the updated version of Phoenix on my toy project. You can access the updated version here: http://34.219.19.64:4001/

Here is a video of me showing that the issue has been fixed in the updated version: https://youtu.be/3t4nckli07k

laurglia avatar Aug 22 '22 12:08 laurglia