grist-core Support HTTP long polling as an alternative to WebSockets

Motivation

This PR is a proposed fix for #840 .

As mentioned in that issue, the motivation for supporting an alternative to WebSockets is that while all browsers supported by Grist offer native WebSocket support, some networking environments do not allow WebSocket traffic.

Plan

Engine.IO was selected as the underlying implementation of HTTP long polling, after careful consideration as described here. Engine.IO unifies WebSockets and long polling under a single API, and transparently upgrades to WebSockets when they are available and tested to work.

Progress so far

References to the WebSocket API and the ws API have been isolated into new classes called GristClientSocketWS, GristSocketServerWS and GristServerSocketWS, derived from abstract classes GristClientSocket, GristSocketServer and GristServerSocket.

The Engine.IO-based alternative to WebSockets was built into classes named GristClientSocketEIO, GristSocketServerEIO and GristServerSocketEIO.

Because I expect the vast majority of Grist users to be perfectly able to use native WebSockets (all existing Grist users are of course in this case since there has not been an alternative so far), I decided to override the default behaviour of Engine.IO, which is to start every connection in HTTP long-polling mode with a subsequent upgrade to WebSockets. Instead, GristClientSocketEIO will first try to connect using the websocket transport and only if this fails, fall back to the polling transport.

Factory methods GristClientSocket.create and GristSocketServer.create switch between implementations according to a new GRIST_USE_ENGINE_IO (boolean) environment variable. This variable has no impact on the build, so for example a built Grist Docker image can switch between transports by simply being restarted with a different value of GRIST_USE_ENGINE_IO.

An extra axis (transport: [ 'ws', 'eio' ]) was added to the build matrix in .github/workflows/main.yml, to run all the tests once with the WebSocket backend and once with the Engine.IO backend. Naturally, this is overkill and will likely have to be removed before shipping, but while building out the feature, it should help attribute regressions to either the Engine.IO backend itself or changes in adjacent code.

Currently, a Grist server started with GRIST_USE_ENGINE_IO=yes seems to function correctly, including in contexts where WebSockets are not available. I tested Firefox with network.websocket.max-connections setting set to 0, and the https://github.com/gristlabs/grist-omnibus image where I modified the Traefik configuration to block the Upgrade header from relayed requests, causing attempts to establish a WebSocket to fail. In both cases the client instantly switched to polling.

To get all of the test cases to pass (particularly the Comm tests) I had to add some form of emulation of the behavior of the ws library to the Engine.IO-based implementation. In particular, while the Engine.IO socket's send method accepts a callback, it is only called upon success and not if the transfer fails. Because a failed transfer immediately causes the socket to be closed, I keep track of messages that have not had their callback called yet and invoke these upon close if any are still pending.

In addition to these API differences, this work has exposed several issues in Engine.IO, some of which I have already reported (https://github.com/socketio/engine.io/issues/695, https://github.com/socketio/engine.io/issues/696, https://github.com/socketio/engine.io/issues/698). One issue in particular (https://github.com/socketio/engine.io/issues/698) required me to monkey-patch the websocket transport provided by Engine.IO. The fix is relatively simple however, so I hope the change I proposed to Engine.IO will be accepted.

Compatibility notes

Although Engine.IO will use WebSockets when possible, it adds some framing to the messages that makes it impossible to connect an Engine.IO-based client to a raw WebSocket server, so the client and server would have to switch backends in sync. This will naturally happen upon a page refresh since the configuration flows from the server to the client, but already-opened clients will need a manual refresh.

Open questions

Should the raw WebSocket backend remain available? In theory, the Engine.IO backend will use WebSockets whenever possible and therefore it would seem unnecessary to maintain a separate implementation. However, this represents a significant change in a critical component of Grist and a conservative approach might be called for. Perhaps a typical sequence of making the new backend available, then making it the default, and finally deprecating the legacy backend would make sense?

Feb 19 '24 15:02 jonathanperret

Should the raw WebSocket backend remain available? In theory, Engine.IO will upgrade to the more efficient transport (i.e. WebSockets over long polling) and therefore it would seem unnecessary to maintain a separate implementation. However, this represents a significant change in a critical component of Grist and a conservative approach might be called for. Perhaps a typical sequence of making the new backend available, then making it the default, and finally deprecating the legacy backend would make sense?

There's a case for doing that, since CI style testing doesn't really capture the diversity of browsers and company firewalls and network latency out there. But I'd guess this would involve significant more work? An alternative would be, once we're happy with the PR, to tag it for release to a fly.io preview system we have before it lands. Then we could have volunteers from all over the world (recruited from our discord) hammer on it a bit. That could still miss a problem that only shows up after e.g. multiple days work with intermittent laptop suspends or whatever, but it would increase confidence that whatever problem remains it'll be something we can handle in follow-up.

Feb 20 '24 21:02 paulfitz

There's a case for doing that, since CI style testing doesn't really capture the diversity of browsers and company firewalls and network latency out there. But I'd guess this would involve significant more work? An alternative would be, once we're happy with the PR, to tag it for release to a fly.io preview system we have before it lands. Then we could have volunteers from all over the world (recruited from our discord) hammer on it a bit. That could still miss a problem that only shows up after e.g. multiple days work with intermittent laptop suspends or whatever, but it would increase confidence that whatever problem remains it'll be something we can handle in follow-up.

To make sure we're on the same page: if when saying "this would involve significant more work" you're referring to the complexity of having both implementations coexist in the code, this is something I have already handled. I briefly tried a wholesale substitution at the start but realized I was not going to be able to ensure the absence of regressions if I was not able to keep running the previous (raw WebSockets) implementation side-by-side.

Basically, with the code in this PR, setting (at run time — it has no effect on the build) GRIST_USE_ENGINE_IO to a falsy value (or not setting it at all) causes the server and client to use the legacy WebSocket-based implementation — albeit with a slightly more convoluted code path due to the GristSocketServer/GristClientSocket abstractions I introduced. I am highly confident that this mode of execution provides 100% compatibility with the existing code.

However you may have been referring to extra work on the release management and support side of things — in which case I agree that there would be an added burden in supporting these two configurations.

In any case, having this new implementation put through its paces by volunteers sound like a great idea. I have just managed to get all the tests to pass, so I'll mark this PR as ready for review and hopefully it can be deployed on your testing environment.

Feb 22 '24 19:02 jonathanperret

I have updated the PR description to reflect the current implementation status.

I have also pushed a commit that adds GRIST_USE_ENGINE_IO=1 to the environment in fly-template.toml, hopefully this is how an environment variable is added to a deployment?

Feb 22 '24 19:02 jonathanperret

The Fly deployment failure seems caused by the fact that PRs based on branches from forks do not have access to repository secrets, for obvious security reasons. You may have to clone this branch to one local to this repository?

Feb 22 '24 19:02 jonathanperret

The Fly deployment failure seems caused by the fact that PRs based on branches from forks do not have access to repository secrets, for obvious security reasons. You may have to clone this branch to one local to this repository?

Yes, you're right, done. And thanks for adding the environment variable, I was just kicking the preview machinery to make sure it still works. For anyone following along, a preview is over in https://grist-http-long-polling.fly.dev/

For the work: thanks for reminding me you had this feature under a flag already. If code review and preview all seem fine I think I'll be voting for making the new behavior the default, though it will be good that any installation that sees trouble has an easy way back to the old behavior.

Will get someone on reviewing this as soon as possible. Thanks a lot for your work! Will be exciting to have this extra level of robustness.

Feb 22 '24 20:02 paulfitz

(just did a quick test in firefox with network.websocket.max-connections set to 0 and working perfectly so far with no websockets, very nice)

Feb 22 '24 20:02 paulfitz

Testing with a more complex deployment including multiple doc workers and a load balancer revealed that the addition of the /engine.io/ prefix to socket requests meant that some load balancer configurations may need to be updated.

I'll be trying to get rid of that prefix so that Grist admins don't need to rework their setup.

Feb 23 '24 18:02 jonathanperret

Hi! I added a few commits that :

remove the /engine.io/ prefix previously added to WebSocket and polling request paths; this should make the new protocol compatible with practically any existing deployment as the URLs are identical. This fixed the deployment on our local staging instances which use path-based routing;
fix reconnection behavior when GRIST_USE_ENGINE_IO=1 (the client wouldn't re-attempt a connection after an initial failure);
implement a proper terminate() method for Engine.IO-based sockets, for quicker connection interruption when necessary;
add the necessary CORS headers for polling to work even in a deployment where workers are on a different domain from the home server. @paulfitz if you update the local branch we can see if this still works on the Fly deployment. @dsagal I'm looking forward to your feedback when you find the time to look at this.

Feb 29 '24 17:02 jonathanperret

Hi @dsagal , I finally found the time to work on this again, putting in what will hopefully be the last big change.

For origin verification, I added a verifyClient callback to GristSocketServer which lets the caller filter both WebSocket and polling connections. When instantiating GristSocketServer, Comm passes a verifyClient callback that calls trustOrigin — which I modified as discussed to use IncomingMessage instead of Request. Hopefully this is not too far from what you had in mind.

More significantly, I have changed my approach to integrating Engine.IO. It is no longer introduced as a unifying abstraction over WebSockets and long polling, as I initially hoped to do. Instead, the existing WebSocket-based protocol remains the default communication method, with an automatic fallback to the long polling implementation provided by Engine.IO. Among other things, this solves a problem revealed by the ManyFetches test suite, which is that the memory usage of the server-side Engine.IO sockets was much higher than the basic ws sockets — I could not determine exactly why but this did not inspire much confidence.

The most commonly taken path will therefore remain the native WebSocket connection without the overhead of the Engine.IO protocol, which makes regressions much less likely.

This in turn led me to remove the GRIST_USE_ENGINE_IO environment variable, which had been an annoyance for which I couldn't formulate a clear exit plan. There is now only one implementation of GristSocketServer and GristClientSocket, which includes the aforementioned fallback behavior.

I have squashed most of the commits into one that introduces the GristSocketServer and GristClientSocket, keeping apart the handful of commits that were not directly related to the main work.

Apologies for the review churn, and as always I'm looking forward to your feedback.

Mar 21 '24 19:03 jonathanperret

I just updated the preview at https://grist-http-long-polling.fly.dev/ and edited a doc using firefox with network.websocket.max-connections=0 on about:config. Worked fine! Checked fly.io backend logs and confirmed that I see transport=polling in logs. When I enable websockets again and reload, I don't see transport=polling.

Mar 28 '24 14:03 paulfitz

Thanks @jonathanperret and @dsagal ! This is really great to have.

One thing that came to my mind recently as possible follow-up. @jonathanperret is it possible to determine whether long-polling fallback is happening consistently, and log a warning if so? Not setting up websocket forwarding correctly is a fairly common configuration problem (one example of many: https://github.com/gristlabs/grist-core/issues/261). Until now people have been forced to notice and fix this problem since they can't use documents at all until they get it right. But now, they might never even notice. Which isn't terrible, but seems a bit sub-optimal?

Mar 28 '24 17:03 paulfitz

Thank you guys for helping bring this to fruition. Can't wait to deploy it and unlock the power of Grist for our restricted-network users.

@paulfitz I see what you mean. I'm not optimistic however that a Grist administrator would notice such a warning, particularly since it wouldn't occur right on server startup.

Maybe calling out WebSockets as worthy of attention in the self-managing docs would help more? It appears there are zero hits for "WebSocket" currently on https://support.getgrist.com.

Mar 28 '24 20:03 jonathanperret

@paulfitz I see what you mean. I'm not optimistic however that a Grist administrator would notice such a warning, particularly since it wouldn't occur right on server startup.

I agree, that is true. There is an admin panel coming soon where installation problems can be highlighted. If the warning existed, then the developer working on that panel would just need to find it in the code and hook up a way to report it better.

Maybe calling out WebSockets as worthy of attention in the self-managing docs would help more? It appears there are zero hits for "WebSocket" currently on https://support.getgrist.com.

Good point! We have a systems engineer joining the team next week whose mandate will include smoothing out the self-hosting experience. I've added that to the list of easy things to start with :-)

Mar 28 '24 20:03 paulfitz

grist-core grist-core copied to clipboard

Support HTTP long polling as an alternative to WebSockets

Motivation

Plan

Progress so far

Compatibility notes

Open questions

grist-core
grist-core copied to clipboard