postgrex
postgrex copied to clipboard
Usually instant query is occasionally extremely slow
Elixir version
1.17.2
Database and Version
PostgreSQL 15.7
Postgrex Version
0.18.0
Current behavior
We have a query doing ilike search on a table with 44M rows; it's heavily indexed and normally returns in ~100ms.
Recently, we started seeing an issue where a search with less than 3 characters can occasionally take more than 15s hitting our configured Repo timeout.
Postgrex.Protocol (#PID<0.5194.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.228259.0> timed out because it queued and checked out the connection for longer than 15000ms
In simplified terms, we search on clients_patients table which has the names of clients & patients and has an index on the coalesce'd names, so we can do fairly quick text search. We then join on the client & patient tables to get the full results. We limit to 15 records, so the overall data returned from Postgres isn't huge.
The issue is seemingly gone if we remove the JOINs.
- We only observe it with this query
- The query has no dynamic params
- It occurs with the same results (i.e.
ORDER BY id) - It seems to only happen with 1-2 characters being searched, haven't hit an issue with 3 yet
- We observe it on the master & both read replicas
- We only observe it when running through Ecto
- running raw SQL in
psqlis always instant - running the raw SQL through
Ecto.Adapters.SQL.query!is always instant - running the ecto query through
Ecto.Adapters.SQL.explainwithanalyzeandbuffersis always instant too, so I've been unable to see what the query plan for the "slow" version is
- running raw SQL in
- When manually increasing the timeout, the query does return eventually, taking at least 30s, but often much more
- The long-running query is not observed in
pg_stat_activitytable
Through adding lots of IO.inspects, we traced it to this line where the msg_send takes place, but it gets stuck waiting on recv_bind
On a fresh Kubernetes pod (i.e. pod that hasn't had this query run, but others would have), the issue doesn't start taking place until the query is executed ~200 times and after that it's fairly regular 1 in 10-30 attempts will hit the timeout.
I also checked:
Process.info(self(), :message_queue_len)when it's waiting to receive the data and it's0every time, on both fast and slow attempts.- No blocking queries in the database while the issue is occurring
- Indexes are valid
- This issue happens on all pods, including ones that receive no/very little traffic and have little going on besides a Phoenix and Oban running without much happening in those
- Postgres stats seem healthy overall
Expected behavior
Would expect the query to be consistently fast 😅
Resolution
The issue went away after I added the following to all Repo configs:
parameters: [plan_cache_mode: "force_custom_plan"]
Not sure what the issue is atm but just wanted to give an anecdote.
I had a similar issue in the past and it turned out that the planner was choosing different indexes across different analyze runs. This is because my data was so big and the values so skewed in certain directions that different samples made it think different query paths were cheaper.
When you say your table is heavily indexed it makes me think this may be worth looking into. The more indexes you have the more chances the planner has of choosing the wrong one.
You might not see it in psql unless you run the query/analyze a lot of times in a row. Or this might not be the issue at all. But just wanted to give you something else to consider.
@tomasz-tomczyk can you try with these two configs and let us know if it changes things?
https://github.com/elixir-ecto/postgrex/issues/497#issuecomment-591661154
Btw, thank you for the well written report. :)
You might not see it in psql unless you run the query/analyze a lot of times in a row. Or this might not be the issue at all. But just wanted to give you something else to consider.
I know what you mean, we hit it too in the past. However, I ran it several thousand times in a loop looking for any instance taking more than 5s and never hit it :(
can you try with these two configs and let us know if it changes things?
Thanks for the suggestion - I assume I can pass these do the Repo.all() call? e.g.
Repo.all(q,
prepare: :named,
parameters: [plan_cache_mode: "force_custom_plan"]
)
If that's the case, that unfortunately didn't seem to help
There may be something to do with the fact that due to the lack of ORDER BY we get different results, am trying to see if there's a correlation between taking a long time and what's contained in the result set
Thanks for the suggestion - I assume I can pass these do the Repo.all() call? e.g.
@tomasz-tomczyk can you try setting it under your repo configuration in config/*.exs just in case?
For the record, I added ORDER BY id just to get consistent results back and I was still hitting the issue.
@josevalim I've added it to the repo configs and haven't been able to reproduce it just yet - including running x1000 in a loop on 7 different pods at the same time, so I'm cautiously optimistic... 🤞
Not sure I fully understand what this does, but reading the linked thread and references there it might have other side effects on the other queries - i'll monitor the app today and weekend and report back next week if anything happens.
@tomasz-tomczyk I would try both options in isolation and see the impact of each one. But basically, PostgreSQL is choosing a poor query strategy, probably based on the connection cache, and the option forces it to not do that.
Also, how big is your pool size? That could help explain why it takes about 200 queries for the issue to appear:
- It seems PostgreSQL caches the query plan after about 5 queries
- If you have a pool size of 30-40, you will need about 150-200 queries for some connections to start using the cached plan
- If some of those connections have cached a poor plan, it will show up every x queries (as you go around the pool)
I was seeing the same issue and tried setting the above properties. Seems to work fine in dev, put in prod (v15 and later v16.4) I get ** (Postgrex.Error) ERROR 08P01 (protocol_violation) unsupported startup parameter: plan_cache_mode" Any clue what I'm doing wrong?
What is your production database? Is it actual Postgres or something that talks Postgres? I can confirm it exists back on PG 12: https://www.postgresql.org/docs/12/runtime-config-query.html
It is managed by our cloud provider. Guess I'll need to ask them... Thanks @josevalim
@tomasz-tomczyk I would try both options in isolation and see the impact of each one. But basically, PostgreSQL is choosing a poor query strategy, probably based on the connection cache, and the option forces it to not do that.
I've tried them on their own:
prepare: :namedon its own did not seem to affect my issue; I was able to replicate as expected in ~250 attemptsparameters: [plan_cache_mode: "force_custom_plan"]on its own did seem to fix my issue; not able to replicate in 1000s of attempts
I was going to leave just the latter in, unless you'd recommend to keep the prepare: :named option for other reasons?
Also, how big is your pool size? That could help explain why it takes about 200 queries for the issue to appear:
- It seems PostgreSQL caches the query plan after about 5 queries
- If you have a pool size of 30-40, you will need about 150-200 queries for some connections to start using the cached plan
- If some of those connections have cached a poor plan, it will show up every x queries (as you go around the pool)
That assessment seems spot on. We started off with pool size of 15 and the team initially assumed we're hitting a capacity on the pool size and over time of this issue expanded it to 50 - and now I was able to replicate the issue after ~250 attempts so you're exactly right. It also explains why I was able to get it to replicate faster earlier on, before the team made the change to the pool size.
Not sure if this is the right place to feedback but it'd be extremely helpful to know that a query has timed out waiting for a connection in the pool vs timing out in Postgres (if that's possible)
I'd be happy closing this issue. Thanks a lot for the input!
Do we think it'd be helpful/possible to pass in the plan_cache_mode option on per-query basis? Since the linked references suggest there might be side effects. FWIW, I haven't spotted anything in our Sentry exceptions that stood out since this was initially implemented, so I'm ok keeping it global in our case, so I never have to think about this setting again!
prepare: :named on its own did not seem to affect my issue; I was able to replicate as expected in ~250 attempts
To be clear, prepare: :named is the default. You should use prepare: :unnamed. :)
Not sure if this is the right place to feedback but it'd be extremely helpful to know that a query has timed out waiting for a connection in the pool vs timing out in Postgres (if that's possible)
Please open up an issue and thanks for the follow up!