postgrex Usually instant query is occasionally extremely slow

Elixir version

1.17.2

Database and Version

PostgreSQL 15.7

Postgrex Version

0.18.0

Current behavior

We have a query doing ilike search on a table with 44M rows; it's heavily indexed and normally returns in ~100ms.

Recently, we started seeing an issue where a search with less than 3 characters can occasionally take more than 15s hitting our configured Repo timeout.

Postgrex.Protocol (#PID<0.5194.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.228259.0> timed out because it queued and checked out the connection for longer than 15000ms

In simplified terms, we search on clients_patients table which has the names of clients & patients and has an index on the coalesce'd names, so we can do fairly quick text search. We then join on the client & patient tables to get the full results. We limit to 15 records, so the overall data returned from Postgres isn't huge.

The issue is seemingly gone if we remove the JOINs.

We only observe it with this query
The query has no dynamic params
It occurs with the same results (i.e. ORDER BY id)
It seems to only happen with 1-2 characters being searched, haven't hit an issue with 3 yet
We observe it on the master & both read replicas
We only observe it when running through Ecto
- running raw SQL in psql is always instant
- running the raw SQL through Ecto.Adapters.SQL.query! is always instant
- running the ecto query through Ecto.Adapters.SQL.explain with analyze and buffers is always instant too, so I've been unable to see what the query plan for the "slow" version is
When manually increasing the timeout, the query does return eventually, taking at least 30s, but often much more
The long-running query is not observed in pg_stat_activity table

Through adding lots of IO.inspects, we traced it to this line where the msg_send takes place, but it gets stuck waiting on recv_bind

On a fresh Kubernetes pod (i.e. pod that hasn't had this query run, but others would have), the issue doesn't start taking place until the query is executed ~200 times and after that it's fairly regular 1 in 10-30 attempts will hit the timeout.

I also checked:

Process.info(self(), :message_queue_len) when it's waiting to receive the data and it's 0 every time, on both fast and slow attempts.
No blocking queries in the database while the issue is occurring
Indexes are valid
This issue happens on all pods, including ones that receive no/very little traffic and have little going on besides a Phoenix and Oban running without much happening in those
Postgres stats seem healthy overall

Expected behavior

Would expect the query to be consistently fast 😅

Resolution

The issue went away after I added the following to all Repo configs:

  parameters: [plan_cache_mode: "force_custom_plan"]

Oct 10 '24 15:10 tomasz-tomczyk

Not sure what the issue is atm but just wanted to give an anecdote.

I had a similar issue in the past and it turned out that the planner was choosing different indexes across different analyze runs. This is because my data was so big and the values so skewed in certain directions that different samples made it think different query paths were cheaper.

When you say your table is heavily indexed it makes me think this may be worth looking into. The more indexes you have the more chances the planner has of choosing the wrong one.

You might not see it in psql unless you run the query/analyze a lot of times in a row. Or this might not be the issue at all. But just wanted to give you something else to consider.

Oct 10 '24 16:10 greg-rychlewski

@tomasz-tomczyk can you try with these two configs and let us know if it changes things?

https://github.com/elixir-ecto/postgrex/issues/497#issuecomment-591661154

Btw, thank you for the well written report. :)

Oct 10 '24 16:10 josevalim

You might not see it in psql unless you run the query/analyze a lot of times in a row. Or this might not be the issue at all. But just wanted to give you something else to consider.

I know what you mean, we hit it too in the past. However, I ran it several thousand times in a loop looking for any instance taking more than 5s and never hit it :(

can you try with these two configs and let us know if it changes things?

Thanks for the suggestion - I assume I can pass these do the Repo.all() call? e.g.

 Repo.all(q,
            prepare: :named,
            parameters: [plan_cache_mode: "force_custom_plan"]
          )

If that's the case, that unfortunately didn't seem to help

There may be something to do with the fact that due to the lack of ORDER BY we get different results, am trying to see if there's a correlation between taking a long time and what's contained in the result set

Oct 10 '24 16:10 tomasz-tomczyk

Thanks for the suggestion - I assume I can pass these do the Repo.all() call? e.g.

@tomasz-tomczyk can you try setting it under your repo configuration in config/*.exs just in case?

Oct 10 '24 16:10 josevalim

For the record, I added ORDER BY id just to get consistent results back and I was still hitting the issue.

@josevalim I've added it to the repo configs and haven't been able to reproduce it just yet - including running x1000 in a loop on 7 different pods at the same time, so I'm cautiously optimistic... 🤞

Not sure I fully understand what this does, but reading the linked thread and references there it might have other side effects on the other queries - i'll monitor the app today and weekend and report back next week if anything happens.

Oct 11 '24 08:10 tomasz-tomczyk

@tomasz-tomczyk I would try both options in isolation and see the impact of each one. But basically, PostgreSQL is choosing a poor query strategy, probably based on the connection cache, and the option forces it to not do that.

Also, how big is your pool size? That could help explain why it takes about 200 queries for the issue to appear:

It seems PostgreSQL caches the query plan after about 5 queries
If you have a pool size of 30-40, you will need about 150-200 queries for some connections to start using the cached plan
If some of those connections have cached a poor plan, it will show up every x queries (as you go around the pool)

Oct 11 '24 09:10 josevalim

I was seeing the same issue and tried setting the above properties. Seems to work fine in dev, put in prod (v15 and later v16.4) I get ** (Postgrex.Error) ERROR 08P01 (protocol_violation) unsupported startup parameter: plan_cache_mode" Any clue what I'm doing wrong?

Oct 13 '24 15:10 westmark

What is your production database? Is it actual Postgres or something that talks Postgres? I can confirm it exists back on PG 12: https://www.postgresql.org/docs/12/runtime-config-query.html

Oct 13 '24 15:10 josevalim

It is managed by our cloud provider. Guess I'll need to ask them... Thanks @josevalim

Oct 13 '24 18:10 westmark

@tomasz-tomczyk I would try both options in isolation and see the impact of each one. But basically, PostgreSQL is choosing a poor query strategy, probably based on the connection cache, and the option forces it to not do that.

I've tried them on their own:

prepare: :named on its own did not seem to affect my issue; I was able to replicate as expected in ~250 attempts
parameters: [plan_cache_mode: "force_custom_plan"] on its own did seem to fix my issue; not able to replicate in 1000s of attempts

I was going to leave just the latter in, unless you'd recommend to keep the prepare: :named option for other reasons?

Also, how big is your pool size? That could help explain why it takes about 200 queries for the issue to appear:

It seems PostgreSQL caches the query plan after about 5 queries

If you have a pool size of 30-40, you will need about 150-200 queries for some connections to start using the cached plan

If some of those connections have cached a poor plan, it will show up every x queries (as you go around the pool)

That assessment seems spot on. We started off with pool size of 15 and the team initially assumed we're hitting a capacity on the pool size and over time of this issue expanded it to 50 - and now I was able to replicate the issue after ~250 attempts so you're exactly right. It also explains why I was able to get it to replicate faster earlier on, before the team made the change to the pool size.

Not sure if this is the right place to feedback but it'd be extremely helpful to know that a query has timed out waiting for a connection in the pool vs timing out in Postgres (if that's possible)

I'd be happy closing this issue. Thanks a lot for the input!

Do we think it'd be helpful/possible to pass in the plan_cache_mode option on per-query basis? Since the linked references suggest there might be side effects. FWIW, I haven't spotted anything in our Sentry exceptions that stood out since this was initially implemented, so I'm ok keeping it global in our case, so I never have to think about this setting again!

Oct 15 '24 10:10 tomasz-tomczyk

prepare: :named on its own did not seem to affect my issue; I was able to replicate as expected in ~250 attempts

To be clear, prepare: :named is the default. You should use prepare: :unnamed. :)

Not sure if this is the right place to feedback but it'd be extremely helpful to know that a query has timed out waiting for a connection in the pool vs timing out in Postgres (if that's possible)

Please open up an issue and thanks for the follow up!

Oct 15 '24 10:10 josevalim

postgrex postgrex copied to clipboard

Usually instant query is occasionally extremely slow

Elixir version

Database and Version

Postgrex Version

Current behavior

Expected behavior

Resolution

postgrex
postgrex copied to clipboard