db_connection Requests dropped with connections available

Hi 👋🏼,

I'm not sure this is the right place to bring this up, so let me know if it's not.

I've been trying to debug this issue for quite some time:

** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2595ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

See DBConnection.start_link/2 for more information

We followed the classic recommendations that you get there and that everyone mentions: make sure your queries are fast, make sure your connection is stable, and make sure, you don't exhaust the pool with many queries. We optimized every optimizable query, ensured that the connection was stable, and checked that the pool was not exhausted, and despite all of that, we are still getting the error sporadically. We collected some telemetry around the status of the pool, and as you can see, the connection pool is generally available with some bursts here and there that keep it busy, but other than that, I don't see why queries would be dropped. In fact, the most recent one dropping happened after 17:00.

So if the connection is stable, the pool is available, and queries are dropped from the queue, what else could get in between a query being in the queue, and a query being checked out? I feel there's something obvious that I'm not seeing.

Apr 17 '25 15:04 pepicrft

Those sudden drops are most likely reflecting lost of connectivity to the database itself. We ping connections every second or so and, if the database does not respond, or does not respond in time, you will see such drops. And then you cannot checkout anything (as there are no alive connections). Could that explain it?

Apr 17 '25 16:04 josevalim

Those sudden drops are most likely reflecting lost of connectivity to the database itself. We ping connections every second or so and, if the database does not respond, or does not respond in time, you will see such drops. And then you cannot checkout anything (as there are no alive connections). Could that explain it?

That'd be the most plausible theory. Is there a way to peak into those attempts and send telemetry events when that happens? That'd help us validate if that's the case.

Apr 17 '25 16:04 pepicrft

Fork postgrex and send a PR to instrument Postgrex.Protocol.ping. :) Run it in prod and if you get good data from it, we can include it by default most likely. It sounds like something worth instrumenting.

Apr 17 '25 16:04 josevalim

Thanks @josevalim. I'll do that. Would you instrument I assume with :telemetry? I had a look at the code, and I'm a bit confused about where the event should be and what should include:

@impl true
@spec ping(state) ::
        {:ok, state}
        | {:disconnect, Postgrex.Error.t() | %DBConnection.ConnectionError{}, state}
def ping(%{postgres: :transaction, transactions: :strict} = s) do
  sync_error(s, :transaction)
end

def ping(%{buffer: buffer} = s) do
  status = new_status([], mode: :transaction)
  s = %{s | buffer: nil}

  case msg_send(s, msg_sync(), buffer) do
    :ok when buffer == :active_once ->
      ping_recv(s, status, :active_once, buffer)

    :ok when is_binary(buffer) ->
      ping_recv(s, status, nil, buffer)

    {:disconnect, _, _} = dis ->
      dis
  end
end

Thanks in advance 🙏🏼

Apr 17 '25 16:04 pepicrft

Actually, looking at the code, if the ping fails, we already emit [:db_connection, :disconnected] if you are using DBConnection.TelemetryListener. Can you try that instead?

Apr 17 '25 16:04 josevalim

I have those plugged into :prom_ex, and while I see the [:db_connection, :connected], I don't see any [:db_connection, :disconnected] ones. I'll share some metrics that I'm using to spot what the issue could be:

Pool status and queue times (p99)

You can see that there are moments when we keep all the connections in the pool busy. The increase in the queue time in those outliers correlates with the moments when the queue is busy.

Execution time and total time (p99)

You can see that there are some outliers in query times, some of which go even above 20s. I'd expect those to be slow queries, but none shows in Supabase.

Further thoughts

We have API endpoints that are hit in bursts. We did some optimizations, but there are some queries that we can't get rid of, so I wonder if it could be the pool design that's unable to gracefully handle scenarios where the pool is filled quickly in a short period of time.

Apr 17 '25 17:04 pepicrft

Ok, so try changing this function instead (it would be the same as changing Postgrex): https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/connection.ex#L172

Emit one telemetry on each branch, saying if the ping succeed or failed.

Apr 17 '25 18:04 josevalim

Ok, I just read the remaining of your comment. Something else to consider is if the machine is overloaded. If you are having bursts and those are causing a bunch of work everywhere else, it will be less time available to deal with connections. Make sure to also monitor your run_queues. You can also set pool_count: 2 on Ecto side, to have two pools, so the load can be distributed across them.

Apr 17 '25 18:04 josevalim

Could you expand on the "overloading" part? CPU and memory-wise the machines are fine around those bursts.

For some reason the queries/requests assigned to the connection don't run, or take a lot of time to reach the "running" state (maybe network?). The other thing that I was thinking about is if there's a deadlock between connections that causes connections to be waiting for each other, but I couldn't find anything in code nor got any errors.

Apr 18 '25 06:04 pepicrft

Measure the run queues for the overloading bits. And make sure that the VM is getting as many cores as available by the machine (see erlang:system_info or your Phoenix dashboard). Also, if you are using shared CPUs, then you may see low CPU usage but the machine is busy anyway.

Connections are independent and the pool is concurrent, but setting pool_count: 2 would help distribute the load if you are worried about something bottlenecking it.

Apr 18 '25 06:04 josevalim

Thanks! I'll do that. How can I measure the run queue?

We use a dedicated machine in Fly so resources, so I wouldn't expect resources to be shared.

Connections are independent and the pool is concurrent, but setting pool_count: 2 would help distribute the load if you are worried about something bottlenecking it.

This would reduce the changes of timeouts, but whatever is causing connections not to go through would still remain, no?

Apr 18 '25 06:04 pepicrft

Schedulers

:erlang.system_info(:schedulers)
2

Schedulers online

:erlang.system_info(:schedulers_online)
2

Physical cores

flyctl scale show -c fly.toml
Groups
NAME    COUNT   KIND            CPUS    MEMORY  REGIONS
app     1       performance     2       4096 MB fra

Apr 18 '25 07:04 pepicrft

@josevalim I continued investigating this, and I noticed connections drop, so by the time we get the burst of requests, they have to be established. How can I subscribe to DBConnection telemetry events? I noticed in the docs it recommends creating the DBConnection separately, but if I'm not mistaken, the Postgres adapter is already creating one. Is it possible to plug the listener to that connection pool?

May 07 '25 08:05 pepicrft

Yes, all options passed to your repository are eventually passed down to DBConnection.

May 07 '25 09:05 josevalim

Thanks @josevalim, then I have it configured right. I checked the emitted events, and I noticed something interesting. There are connections happening sporadically, but no disconnections. Is it possible that the connection drops or becomes unusable without the pool being aware of it, and when it tries to use them it establishes the connections again?

I have the feeling the connections are being dropped due to some connection issue, but I'd have expected the events to show up. They do in the logs:

2025-05-05 18:03:39.979	
16:03:39.977 [error] Postgrex.Protocol (#PID<0.4690.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv (idle): timeout
2025-05-05 18:03:39.979	
16:03:39.977 [error] Postgrex.Protocol (#PID<0.4698.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv (idle): timeout
2025-05-05 18:03:39.979	
16:03:39.977 [error] Postgrex.Protocol (#PID<0.4678.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv (idle): timeout
2025-05-05 18:03:39.979	
16:03:39.977 [error] Postgrex.Protocol (#PID<0.4684.0>) disconnected: ** (DBConnection.ConnectionError) ssl recv (idle): timeout

May 07 '25 09:05 pepicrft