electric icon indicating copy to clipboard operation
electric copied to clipboard

Feature request: Need a public interface to stop the Electric Connection

Open evanob opened this issue 3 months ago • 5 comments

During a rolling deploy, we have the old node holding its advisory lock to ensure only one consumer of the replication slot, and we have the new node waiting for that lock to be released. The new node starts serving traffic but will respond with 503s until the old node is killed, which could take minutes.

We would like to be to detect when a new node is deploying (e.g. with libcluster), and then automatically stop the Electric Connection on the old node, thus releasing the lock.

For now, we are using GenServer.whereis to find the Connection.Manager and using GenServer.stop, but this doesn't feel quite right.

evanob avatar Oct 08 '25 08:10 evanob

Hey @evanob 👋

Could you elaborate a bit on your setup? I'm curious why this

The new node starts serving traffic but will respond with 503s until the old node is killed, which could take minutes.

is working that way. I would expect that your load balancer keeps routing requests to the old node, since the new one doesn't return a 200 OK from its health endpoint until it has acquired the lock and finished initializing.

Sounds like you're using Electric in the library mode to be able to build a rolling deployment layer with libcluster on top?

alco avatar Oct 08 '25 09:10 alco

We're running on embedded mode. Not sure if that is the same as library mode?

We don't require that the node has acquired the lock to start serving traffic, at least not yet. As we rollout in ECS (with min 1 server, max 2), we would end up in a sort of deadlock, where the old server won't be killed until the new server is healthy, and the new server won't be healthy until the old server is killed (or at least until it releases the lock, which is why I'm looking for such a mechanism).

We're also still gradually adopting Electric, and haven't yet worked out a reliable deployment/handover strategy, so we don't want to make it a hard requirement for our backend health.

evanob avatar Oct 08 '25 09:10 evanob

@evanob curious to learn more about your setup. I am doing something similar and have the same issue. right now, our app just serves 503s during the deployment. as the client just retries, it is not a major issue

jay-babu avatar Oct 08 '25 15:10 jay-babu

@jay-babu Right now, we serve 503s during a deploy too, but we're trying to reduce the time spent in 503.

We're using libcluster to create a temporary cluster during deploy, and we add our own connect_node function, where we stop the running Electric.Connection.Manager. This frees the new node to take the lock. This works ok, at least on the happy path. We're also reducing the long_poll_timeout from 20s down to 10s, but we're unlikely to keep this in the long run.

evanob avatar Oct 09 '25 10:10 evanob

Now that the database connection scaledown is in main, we have a new function Electric.Connection.Restarter.stop_connection_subsystem(<stack_id>) that's stable enough. It will stop the connection manager process. Note that an incoming shape request will start the connection manager back up, so you really need to have the new instance waiting on the lock so that it could grab it immediately.

<stack_id> should be "single_stack" unless you set it explicitly via stack_id or provided_database_id config option.

alco avatar Oct 10 '25 21:10 alco