anycable-go icon indicating copy to clipboard operation
anycable-go copied to clipboard

Client cannot reconnect after going to sleep

Open bradrees opened this issue 4 years ago • 8 comments

Tell us about your environment

**AnyCable-Go version: 1.0, 1.1

**AnyCable gem version: 1.0, 1.1

What did you do?

When a device goes sleep, after waking the client is not able to reconnect to the websocket. The connection is immediately terminated. Even calling window.reload does not fix the issue.

Note that this is most noticeable when using an iOS WKWebView or Electron, the web browsers seem to have some kind of connection resetting mechanism.

What did you expect to happen?

The client should be able to reconnect upon wake

What actually happened?

The websocket connection is terminated immediately.

We have been struggling to reliably reproduce this in a development environment, however we have seen the following message which may be related.

WARN 2021-11-10T02:48:16.917Z context=node sid=jP8qYdO0jG86C5TsuLD-G Failed to handle incoming message '{"command":"subscribe","identifier":"{"channel":"UserChannel"}"}' with error: Already subscribed to {"channel":"UserChannel"}

It is also worth mentioning this only happens when the client device sleeps for some specific window, if too short the connection will reestablish, and too long and the app seems to unload enough from memory that the when it wakes the connection gets recreated. It is like the connection is left in a half open state, where it can't communicate over it, but it still thinks it is active. Typical times would be from say 4-12 hours of sleep, maybe more.

bradrees avatar Nov 10 '21 03:11 bradrees

I've been trying to test this all day, and so far I've not had much in the way of a solution.

Firstly on the client every 20 seconds the network tab will try to connect to /cable and respond with the following after a while: WebSocket is closed before the connection is established.

I've had some scenarios where I can have one browser connect, while the other does not, then if I restart the server the same browser will reconnect. If I change the anycable-go binary between my fork and the release one, it seems to randomly swap the one that one will connect.

I have forked and added logging to the handler.go file, to see if it even gets that far to connect, but nothing is logged on a connection attempt. I can see in the metrics client_num=0 or 1, accurately reflecting the state of the browsers. The perhaps indicate a lower level network issue?

In addition after I left my laptop for some time, it would not reconnect on either browser window. I continued to do other things, and after about 24 minutes one of the browsers reconnected, then another ~90s later the second browser connected.

I'm not sure my attached PR fixes anything related to this issue, although it does prevent that 'Already subscribed' message from showing up.

bradrees avatar Nov 10 '21 09:11 bradrees

Adding to it, if you reload the page (Normal reload, hard reload, clear cache and hard reload), it does not fix it

bradrees avatar Nov 10 '21 09:11 bradrees

Thanks for the detailed report and the PR!

I will take a look tomorrow.

palkan avatar Nov 10 '21 17:11 palkan

is not able to reconnect to the websocket. The connection is immediately terminated.

Any logs at the anycable-go side? (with -debug on or ANYCABLE_DEBUG=1)?

Do you track failed_auths metrics?

Even calling window.reload does not fix the issue.

That's interesting. No sign of Connect/Disconnect in the logs?

Which client-side library do you use and how you handle reconnects?

palkan avatar Nov 11 '21 15:11 palkan

Thanks for taking a look!

There is no connect or disconnect messages with -debug, when I track the metrics it doesn't show any connections or auth failures with --metrics_log. It is basically silent. I added additional logging to the handler.go, right at the top of the method to see if it even gets as far as registering a connection attempt, but there is nothing logged.

It also takes a while for the connection to fail - it is not instant, if that helps. This is possibly indicating a timeout, rather than some explicit rejection.

We are using the redux-cablecar package as well as standard @rails/actioncable package. Both fail in the same way. Redux-cablecar does use @rails/actioncable for the connection handling though, so this could/should be expected. We can see it trying to reconnect on the client, so no issues there as far as I can see.

We've tried disconnecting, resetting and everything else we can think of on the client side. Come to think of it I might try set up a couple of subdomains to connect to, and the alternate between them on each retry.

I suspect that there is a connection that is being left half open, where one side has closed the connection and the other still retries it thinking it is open. However I'm not sure which side is being kept open, or even how we would go about forcefully closing it.

I can reset the connection on mobile by 'swiping away' the app, and likewise on Electron I can force reload, which I believe might close the renderer process and restart it, compared to Chrome/Edge which seem to reuse the same process on soft and hard reload.

bradrees avatar Nov 18 '21 03:11 bradrees

We can see it trying to reconnect on the client, so no issues there as far as I can see.

And it just fails to reconnect without any logs at the server side?

It also takes a while for the connection to fail

You mean for "Disconnect" logs to appear on the server?

Yeah, that looks like the default TCP timeout (for CLOSE_WAIT connections).

Btw, were you using locally running anycable-go for reproduction or this only happens in production environment?

Come to think of it I might try set up a couple of subdomains to connect to, and the alternate between them on each retry

Maybe, worth trying adding a random query param to the connection url (if this is some kind of browser connection keepalive mechanism, that could help).

palkan avatar Nov 19 '21 15:11 palkan

I have an update - I have tried using a query string and that did not help, so I then tried using multiple subdomains. At first appeared to help, however we've seen the issue again on Safari Desktop. It also means that sometimes it will take quite a lot longer to reconnect as the first connection needs to timeout first - we've set it to try the alternate subdomain after 5 seconds of not establishing a connection.

This happens in our local, staging and production environments, however as we are not fully launched with the app yet, our production and staging environments only had a single instance/service running. I've increased this to have multiple instances yesterday and will keep you updated.

bradrees avatar Dec 09 '21 08:12 bradrees

Hey @bradrees!

Any updates on the issue?

palkan avatar Jul 08 '22 02:07 palkan

Closed as stale

palkan avatar Nov 24 '22 17:11 palkan

I have resolved it, using a combination of things although I'm not sure which one actually fixed it, or it was a combo of several factors, however once we added multiple instances we stopped seeing the issue.

I now run multiple subdomains and trigger a reconnect to a different subdomain when reconnecting. I am also running multiple instances of the anycable-go server anycable rails server, so as to connect to a completely separate instance. This seems to work well and I haven't seen the issue re-appear. 🤞

bradrees avatar Dec 02 '22 07:12 bradrees