pgapp icon indicating copy to clipboard operation
pgapp copied to clipboard

Extremely slow queries with timeouts can incapacitate the pool

Open Licenser opened this issue 8 years ago • 4 comments

Running extremely slow querries, that time out can result in the entire pool becoming unusable. So far it looks like the following happens:

We have: A pool with the workers [A, B, C]

  1. Start very slow query, worker A get checked out and handed the query.
  2. Pool has now the workers [B, C].
  3. Worker A returns with a timeout (the slow query still running) and gets added back to the pool as the 1st element of the list.
  4. A new query checks out A again, but A is still busy with the slow (timed out) query resulting in it becoming inoperative.
  5. As A is the 1st worker in the pool now every 1st checkout will now always time out.

This has been verified by:

  1. getting the state of pool after 3 showing A being first element agian
  2. querying A's via get_state (no reply)
  3. querying process_info for A (showing the query of 4 in the message queue as well as the get_state request).
  4. spawning the query 4 in a off process, waiting 500ms and then executing 4 again (forcing A to be taken off the queue by the spawned process) and receiving correct results.

Licenser avatar Feb 17 '17 10:02 Licenser

Do you expect A to be killed in case of timeout? FYI, pooler's return_member/3 have 3rd argument indicating if worker should be returned to pool or killed and replaced. But poolboy have nothing like this

seriyps avatar Feb 17 '17 12:02 seriyps

Given the PR above, I think yes that is a decent approach, however I'm happy to be convinced otherwise.

Licenser avatar Feb 17 '17 12:02 Licenser

@seriyps this is nice in pooler, too:

You can configure Pooler to periodically check for and remove members that have not been used recently to reduce the member count back to its initial size.

I and others kept arguing for something like that to be added to poolboy, but I don't know if they ever did. It makes a lot of sense for an expensive resource to not be 'returned' right away when you have a burst of traffic.

davidw avatar Feb 17 '17 17:02 davidw

Agreed, I think the suggested 'kill on timeout' is sensible tradeoff, as long as the connection works within the requiested timeframe it can keep on living but connections that time out have to die, but at least the death is quick and merciful.

Licenser avatar Feb 17 '17 19:02 Licenser