sprocket Reconnection handling: Only tries once, needs continuous retry for production use

Hey there! 👋 First of all, thanks for creating Sprocket - it's a really nice library and most things work great!

I've run into an issue that's currently blocking us from using it in production. When the connection drops (like when closing a laptop lid or backgrounding a mobile browser), Sprocket only tries to reconnect once. I've noticed this in the sprocket_starter demo - when I close my MacBook's lid for a short time, the close handler runs. I imagine the same would happen on mobile devices when the browser goes into the background.

After this single reconnection attempt fails, subsequent socket.send() calls fail without any recovery path. This makes it challenging to use in production where these kinds of interruptions are common.

Claude suggested this code for implementing continuous reconnection:

socket.addEventListener("close", function (_event) {
    topbar.show();
    console.log("Connection lost, will keep trying to reconnect...");
    
    // Create the reconnection function that will keep trying
    const attemptReconnect = () => {
        console.log("Attempting to reconnect...");
        // Create new socket and store its reference
        const newSocket = connect(path, opts);
        
        // Monitor this new connection attempt
        newSocket.addEventListener("open", () => {
            console.log("Successfully reconnected!");
            topbar.hide();
        });
        
        newSocket.addEventListener("close", () => {
            console.log("Reconnection failed, trying again in 5 seconds...");
            setTimeout(attemptReconnect, 5000);
        });
    };
    
    // Start the first reconnection attempt after 5 seconds
    setTimeout(attemptReconnect, 5000);
});

However, I haven't tried implementing this yet, and I'm wondering if we also need to handle queuing/freezing socket events until a new connection is established? I'm not very experienced with TypeScript/JavaScript (even though I authored a VSCode plugin that's about to hit 10 million installs... go figure 😅), so I'd really appreciate any guidance on the best way to implement this.

Would love to hear your thoughts on:

Whether continuous reconnection attempts make sense
If we should add some user notification about connection issues
How to handle messages during disconnection periods

Thanks for considering this! Let me know if you need any additional information.

Jan 11 '25 09:01 oderwat

Hi, thanks for the suggestion! This is a great idea and something that has been on my todo list. I've previously explored a approach similar to this and also tried using an existing library such as reconnecting-websocket but I wasn't able to get it working quite the way I wanted so I set it aside.

But this is certainly something we will want to fix for any real production apps. I think we can take some inspiration from Phoenix LiveView here and do a similar approach.

Continuing to try and reconnect totally makes sense. I think there are 2 cases we need to consider here, but to keep things simple we can probably just focus on 1.1 to start and introduce 1.2 later as an optimization: 1.1. First, the connection has been terminated for a long period of time (computer/phone sleep, laptop closed, etc) in which case the server process will terminate and when the client returns and tried to reconnect, a new session will be started with a complete state refresh from the server. This could either be from a DOM replacement or simply forcing a complete browser refresh 1.2. Second is the case where there is some flakiness in the network connection resulting in a short disconnect but the client will eventually reconnect soon. In this case it might be okay to leave the server state in place for a short period of time to allow a client to try and reconnect. Can also can likely inform the server of intentional disconnect (e.g. closing browser tab) opposed to abrupt/unexpected disconnect in order to facilitate whether the server will try and wait for a reconnection for a short period of time or immediate terminate the session.
I think a user notification would be great. However, Sprocket itself should not render anything in the browser but should provide some sort of facility for the app developer to handle this condition and perhaps we could provide a default example in the starter app. For example, Sprocket could attach status classes i.e. disconnected, reconnecting, etc. to the top-level DOM node where the sprocket component is mounted. It could also provide a JavaScript event that could be subscribed to by an app.
Sprocket requires a constant network connection to function, as it is simply showing the rendered representation of some server state and relaying events back to the server. For simplicity of state management and ensuring we prevent unintended delayed effects from happening, I think we should stick to this assumption and Sprocket should not try to queue any events or messages that may occur during a disconnection period. When reconnected, Sprocket should get the client up-to-speed with the latest server state with a full state refresh and then return to sending client event messages

Jan 11 '25 16:01 eliknebel

TL;DR: Thanks for the detailed explanation! Having the ability to retry failed messages after reconnection would be really valuable - I've seen this with laptop sleep/wake cycles where socket.send operations fail but could easily recover after reconnect. Also, having a configurable state retention time (from seconds to hours) would help both with quick app switches and longer breaks. For our typical user base (20-50 users), keeping state around should be manageable.

Let me explain why Sprocket caught my attention. It's the closest thing I've seen to a conventional webserver while allowing for real complexity. About 15 years ago, I worked with taconite, a library for DOM manipulation via ajax. We still use this in one of our major applications. Our business focuses on custom software for medium-sized companies, maintaining long-term client relationships.

In the last 2-3 years, we've been using Docker + NATS (for pub/sub, message queues, and storage) with custom CI/CD tooling. While it works, managing 30+ client apps (PWAs using Go-App) and 50+ services (Docker Compose) has become challenging. Our clients frequently request changes - often wanting new features demonstrated within days.

For our next projects, we need something that enables rapid CI/CD - potentially 10+ daily production updates behind feature flags. I'm excited about using BEAM with Gleam (plus some Luerl and possibly NIFs). We want to move away from SPAs where possible, though we'll still need PWAs for offline functionality. We'll probably use Go/Go-App with NATS for those cases, bridging to BEAM instead of relying on message queues in Clickhouse or PostgreSQL.

Discovering BEAM/OTP (and Gleam) has been eye-opening, and finding Sprocket was particularly exciting. While I'm new to this ecosystem, it seems perfect for our needs, and I can get my team started with it.

Regarding state retention - I'm thinking of it like a save system in a game. Sometimes you need full saves, sometimes just checkpoints. Having the server maintain state for various durations would really improve the user experience, whether it's brief mobile app switches or returning from a lunch break.

We're already sponsoring Go-App and Gleam, and we'd love to contribute to Sprocket as well. My colleague and I are currently evaluating how to align these technologies with our vision.

Jan 11 '25 23:01 oderwat

I see that you made quite a lot of changes. I used the newest sprocket_tester binding to 0.0.0.0 and then tested a different computer and an iPhone. The automatic reconnect seems to work nicely now.

Switching WLAN on/off on the MacBook seems to work and even when "queuing" some events offline. I did some crude tests and had it reload to initial state sometimes, but could not reproduce when that happens. All in all, this feels fine though.

Going to the iPhone it is another story. It resets the state when I switch applications or lock it even just for a second. This brings some problems with interactions like using a 2FA application or copy/past data to/from another app.

What can be done to avoid that?

Maybe also more general: What could be done to have the server "save" state in between sessions?

Here a part of the log output when I was switching WLAN and interacted with the app. I wonder about the message that says that the runtime was not found:

info: GET /connect 200 sent in 322µs
Clock component mounted!
info: GET /connect 200 sent in 321µs
Clock component mounted!
warning: Actor discarding unexpected message: TcpError(//erl(#Port<0.12>), Etimedout)
info: GET /connect 200 sent in 381µs
Clock component mounted!
info: GET /connect 200 sent in 605µs
Clock component mounted!
info: GET /connect 200 sent in 298µs
Clock component mounted!
info: GET /connect 200 sent in 365µs
error: failed to handle websocket message: ["event",{"id":"cm61kbcuu000nx0shvo9yd4uj","kind":"click","payload":{"clientX":140,"clientY":331,"ctrlKey":false,"shiftKey":false,"altKey":false,"metaKey":false}}]
"Sprocket runtime not found. Runtime must be started before handling messages"
Clock component mounted!
info: GET / 200 sent in 1ms
info: GET /connect 200 sent in 275µs
Clock component mounted!
info: GET /favicon.ico 200 sent in 1ms

Jan 18 '25 02:01 oderwat

I made some updates regarding automatic reconnect, which I believe is working much better now. I'm glad that this seems to have resolved some of the issues you originally were seeing.

As for the tab switching on a mobile device, I agree this is problematic specifically for some of the use cases you outlined (e.g. 2FA application or copy/pasting data to/from another app, etc.).

I think the best approach here is to introduce a configurable TTL for server runtime sessions, perhaps defaulting to something reasonable like a few mins. This will require adding a unique identifier that the client can use on reconnect to try and load an existing, previously disconnected session. We might be able to also make use of this same approach to keep the runtime session around after the initial render of the view on the first HTTP request, preventing any effects or data loads from having to run twice as they do now (once, for the initial "first paint" response, and again when the websocket connects and starts the runtime session)

Jan 18 '25 20:01 eliknebel

After looking into this a bit more, I'm hesitant to take the approach that I outlined above directly to the sprocket core library, as it bakes some assumptions into library, such as the overhead of keeping connections alive an in memory after a disconnect for some period of time and it also makes the possibility of scaling out more complex since there will have to be a single store/lookup table for all these runtimes.

I'm not really sure how to solve in a flexible way where the set of runtimes that are connected/disconnected can be stored without using another actor in the layer between the mist websocket actor and the sprocket runtime actor. Sprocket previously had something similar to this, called the cassette, which held on to running processes for a short period of time after the initial html render and then reconnection, but it was more complex, sort-of flaky and didn't allow for the flexibility that the web server can easily control the lifecycle of the runtime.

It looks like Phoenix LiveView takes an interesting approach specifically for forms, where after a crash or disconnect, it will call the phx-change handler for that form. Sprocket could take a similar approach where when remounting a form, if the from has any changes on the client, we send a "change" event which would re-hydrate the state in the runtime. But like Phoenix, this would seem to only be helpful for forms. I would be more interested in adding this sort of an approach into the core library.

If that isn't sufficient, then I think the next best place for something like this is in the mist_sprocket module, which is where the web server currently creates and connects to a runtime session. This is actually not currently part of the core library, but is just checked in to the starter app and can be freely modified by anyone to add this actor and lookup table and keep the state around as long as desired.

I also found a discussion on the Elixir forums about the same issue, and it was suggested (by Jose) to persist the the data in a way that can always be reloaded after a disconnect, such as session storage or database. So that is also another viable approach.

Jan 19 '25 23:01 eliknebel

It would be so nice if anything ever would be easy :)

To me there are two cases to handle:

The "some seconds" changing of apps on mobile. I am not sure why that is a problem in the first place. It only seems to happen on the phone (or iPad) but not when I drop the network on the MacBook for a short time. So iOS may drop the connection actively. It also creates an event when it goes into the background so that may be related. While snooping around I found that: https://developer.mozilla.org/en-US/docs/Web/API/WakeLock this is supposed to work in safari-mobile. And maybe also in a PWA. I can test this with some of our Go-App based PWAs which use a Websocket for a NATS connection.
A general way to store the "session" like safe in a game.

I feel a bit useless in this stage, as I am too new to Gleam and most of the related stuff. Thank you for taking your time to think about these problems.

Jan 19 '25 23:01 oderwat