Preserve active websocket connections when unrelated routes are updated
Issue Details
We run multiple apps behind a single Caddy instance. Some of these are Blazor Server apps, which rely entirely on websockets (even for the UI).
The problem: every time we reload the caddy config, all websocket connections drop, even if the routes those clients are on didn’t change at all.
Example:
app-v1 running on port 5000
Later we add app-v2 on port 5001 via a new route
As soon as we push that new route, everyone connected to app-v1 gets kicked. The same thing happens if we update unrelated routes that don’t even use websockets — for example, if we add an unrelated REST API service to the config, connections to websocket apps still get dropped.
We do have reconnect logic, but because Blazor Server depends fully on websockets, the constant reconnects hurt the UX. We’d really prefer connections for unchanged routes to just stay alive.
Would it be possible for Caddy to preserve websocket connections on reload when their route/handler hasn’t changed?
** We noticed a similar discussion in #6420, but the conversation there remained unfinished, so it wasn’t clear whether preserving connections across reloads is planned.
You can tune this behaviour with https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#streaming stream_timeout and stream_close_delay. The default closes websockets immediately so that the old config can be garbage collected to keep memory in check, but you can push it back by using those options so that the old config sticks around to allow connections to cycle out gracefully.
Thanks for the suggestion.
I’ve experimented a little with stream_close_delay. From my understanding, it just keeps the old handler objects in memory a bit longer, giving a short “grace period” for existing connections. But that’s not really what we need — we’re not just trying to avoid simultaneous disconnects.
Our main problem is that websocket connections for routes that haven’t changed are still closed on every config reload, even though the handler itself didn’t change. Ideally, these connections would remain alive until the client actually disconnects, without being forcibly dropped by Caddy.
Do you think this is something that could be addressed via a plugin or extension, or is it too tightly coupled with Caddy’s core config reload logic?
Thanks again for your help!
It's not possible at all. The config is one big thing, any handler inside of it is a part of the whole. The websocket connection handling is basically a blocking loop inside a goroutine spawned from the reverse_proxy handler, which means anything the reverse_proxy handler has as state and any references it holds (e.g. Context, which also holds the whole config) has to be kept around until no more connections are actively using it.
The only thing we can do is give those mitigations like stream_close_delay. You can set that to a pretty long time (a few hours, idk) and it would prevent websockets from being closed automatically on config reload (giving users a few hours to reconnect naturally at some point within those few hours).
Ideally, these connections would remain alive until the client actually disconnects, without being forcibly dropped by Caddy.
That's what the stream delay option is for. It gives time for the client to disconnect first.
But, WebSocket clients should reconnect on loss anyway. Not doing so seems fickle and unreliable?
(Ah sorry, Francis pointed out to me that you mentioned the client does reconnect.)
This is just a really tricky thing, so I'm not sure a better way to handle it.
(jumping in as a colleague of @Claudiu2222)
You can set that to a pretty long time (a few hours, idk) and it would prevent websockets from being closed automatically on config reload (giving users a few hours to reconnect naturally at some point within those few hours).
This is what we will probably end up doing. Some of the applications are point-of-sale and self-ordering kiosk apps, so they don't naturally close/reopen since the Blazor SignalR connection is kept alive even on navigation and users don't close these websites at all during the day. I see how the use of a delay works well for most websocket usage scenarios, our usecase is the exception here, where reconnects actually impact user experience and where we don't have naturally occuring closing of connections by our users (an unfortunate combination of constraints 😅), so we will have to find a way to work around it.
Regarding stream_close_delay, if we set it to be a few hours for the routes where we use websockets, this will help with the issue of closing the websockets more gracefully when other routes update (if we force the signalr connection to close at safe spots in the app). However, if we do this, we will have the unintended behavior of keeping the connection open for the delay duration even when we do want it to reset right away, specifically when we update the route for the signalr using application (eg version updates).
The behavior of the websocket resetting when we update the route for the websocket using apps is great, it works as intended for us, so by adding the delay to help with the resetting caused by unrelated route changes, it will break the intended behavior for when we do make changes to the websocket using routes.
I'm guessing this isn't something that exists since I didn't find it in the docs, but is there a way to differentiate when the websocket closing is caused by an unrelated route change vs by a route change of what the websocket is using, maybe to have different delay periods depending on if its a change from a different route or not? Or, for another way that could help us work around this issue, is there a way to make Caddy reset the websocket connection early (so not to wait for the delay or natural closing) on a given route using the Caddy API?
No, config reloads is all or nothing, there's no granularity. Even if there was, routes don't have anything to identify them uniquely so there's no way logistically to be able to compare old to new.
And no, there's no API to force close existing connections because that's just not something anybody needs from the general userbase. This is the first time we've heard this kind of request. I'm not sure it makes sense though, you should work through it at your application layer instead to do something like that.
so they don't naturally close/reopen since the Blazor SignalR connection is kept alive even on navigation and users don't close these websites at all during the day.
Then it should be no problem that they do close eventually on a delay, since the user will probably be idle by then and the frontend will just automatically reconnect, and they won't notice a thing.
I also encountered this problem when I switched from Envoy to Caddy. I am now considering whether I should switch back to Envoy. Because when Envoy reloads the configuration, WebSocket will not be disconnected, and I think HTTP2 will not be disconnected either. If there are a lot of connections and the reload causes them to disconnect and then reconnect to the server simultaneously, I'm concerned that this will negatively impact the server's responsiveness at that point.
Curious @zuiwuchang , maybe you would know -- if you change your configuration, how does Envoy apply that new configuration to the existing web socket connection while it is still active?? Would love to solve this. Maybe other projects have some magic solution.
I don't know how Envoy does this, and I haven't studied Caddy in detail, but I might do this myself:
- Add a reference count to the original service route, and only clean it up when the count reaches 0. After a reload, the original route will no longer accept new requests, but will continue to serve existing requests.
- Change the delay in stream_close_delay to permanent, waiting for existing clients to close themselves. Eventually, once all old requests have closed, the old route count will return to 0, and the old route resources can be cleaned up.
@francislavoie What do you think of a reference count? I suppose we could look into that.