Refactor and fix connection initialization/reconnection backoff

Open alco opened this issue 8 months ago • 1 comments

The highlight of this PR is the code refactoring inside Connection.Manager module to make the state machine that it implements more explicit and transitions between steps better documented.

Not too long ago we made the initialization of ReplicationClient asynchronous which had been necessitated by the possibility of its startup getting blocked on a pending DB transaction. This PR also makes LockConnection's initialization async for consistency. Both types of processes now return early from their respective init() callbacks and then cast messages to the connection manager when reaching a new milestone (connected, lock_acquired, ready_to_stream, streaming_started).

What prompted this refactoring was a new failure state we had discovered. If there's a pending transaction that blocks Electric's initialization for the first time and Electric is then shut down, Postgres will still have a running backend blocked on the CREATE_REPLICATION_SLOT statement. When Electric starts again, it will be able to complete the full connection setup (because the replication slot has already been created in Postgres) but will fail when switching ReplicationClient into streaming mode because it's aready actively used by a different OS process (the walsender backend in Postgres stuck from Electric's first launch).

Electric's failure to handle this specific scenario resulted in that it got stuck in a reconnection loop without exponential backoff, always trying to reconnect every 2 seconds. Additionally, it resulted in a snowballing of warning logs with each new logger call getting added on every reconnection attempt, due to the absence of any way to keep track of periodic connection status checks and the logs they were emitting.

The PR fixes both the runaway logging and the error handling of this new failure state. It is now simulated in the new integration test defined in replication-client-self-conflict.lux.

A new unit test is added that shows the minimum required configuration to start a Connection.Supervisor in isolation. Trimming it down further to only start the Connection.Manager process would require some mocking which I dislike.

As a nice bonus, I discovered a bug in reading the connection options and pool options from application env: the app env wasn't actually being consulted. Probably not a problem in Electric Cloud but it did result in the ELECTRIC_QUERY_DATABASE_URL config to be ignored.

For code reivew, I suggest opening the updated code for Connection.Manager and reading it top down. This should be easier than trying to make sense of the diff view.

Apr 28 '25 18:04 alco

I wanted to note that with the LockConnection's initialization now also async, the following error may get logged more frequently:

17:23:08.783 pid=<0.386.0> [debug] Starting replication client for stack single_stack

17:23:08.795 pid=<0.697.0> [error] :gen_statem {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Postgres.ReplicationClient, nil}} terminating
** (Postgrex.Error) ssl not available
    (stdlib 6.2) gen_statem.erl:3864: :gen_statem.loop_state_callback_result/11
    (stdlib 6.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Process Label: {:replication_client, "single_stack"}
Queue: [internal: {:connect, :init}]
Postponed: []
State: {:no_state, %Postgrex.ReplicationConnection{protocol: nil, state: {Electric.Postgres.ReplicationClient, %Electric.Postgres.ReplicationClient.State{stack_id: "single_stack", connection_manager: #PID<0.386.0>, transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [{:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Replication.ShapeLogCollector, nil}}}]}, relation_received: {Electric.Replication.ShapeLogCollector, :handle_relation_msg, [{:via, Registry, {:"Elixir.Electric.ProcessRegistry:single_stack", {Electric.Replication.ShapeLogCollector, nil}}}]}, publication_name: "electric_publication_default", try_creating_publication?: true, start_streaming?: false, slot_name: "electric_slot_default", slot_temporary?: false, display_settings: ["SET bytea_output = 'hex'", "SET DateStyle = 'ISO, DMY'", "SET TimeZone = 'UTC'", "SET extra_float_digits = 1", "SET IntervalStyle = 'iso_8601'"], origin: "postgres", txn_collector: %Electric.Postgres.ReplicationClient.Collector{transaction: nil, tx_op_index: nil, relations: %{}}, step: :disconnected, applied_wal: 0}}, auto_reconnect: false, reconnect_backoff: 500, streaming: nil}}
Callback mode: :handle_event_function, state_enter: false

17:23:08.799 pid=<0.386.0> [warning] Failed to connect to the database using SSL. Trying again, using an unencrypted connection.

We need to make sure it's ignored in Sentry. The is logged when DATABASE_URL or ELECTRIC_QUERY_DATABASE_URL is configured without sslmode or with sslmode=prefer/require while the Postgres cluster is not configured to serve over SSL. The LockConnection process exits and the exit reason is logged by OTP machinery separately from our handling of that exit.

The reason you may have not seen these logs earlier was because our LockConnection initialization was synchronous, and so this error was returned from its start_link() call as opposed to getting logged by OTP, and also because when only DATABASE_URL configuration was set, Electric would perform fallback to noSSL once, and so by the time it got to asynchronously initialization the ReplicationClient process, its connection opts would already have ssl: false.

Apr 30 '25 15:04 alco