solid_queue icon indicating copy to clipboard operation
solid_queue copied to clipboard

DB connection pool size

Open kersuzananthony opened this issue 1 year ago • 2 comments

Hello,

In the company I work for, we are considering to use the solid_queue gem to process our background jobs.

Long story short, in our database configuration file, we have setup the pool size to 1.

It appears that the default configuration does not match with the solid_queue usage of the pool even when we setup the workers to use 1 thread.

For reasons you can imagine, changing the pool size in the database configuration may not be the best approach in our case as it will require us to audit our main app codebase and make sure everything, regarding the DB, is thread safe, which is currently not the case.

Is it possible to have a way to start the workers without relying on a different thread pool for running the jobs? It could be something configurable and could be an alternative for users of the gem where the DB pool size cannot be easily increased.

Thanks for reading!

kersuzananthony avatar Aug 01 '24 09:08 kersuzananthony

Hey @kersuzananthony, thanks for considering solid_queue for your app 🙏

I'm afraid there isn't a way right now to run Solid Queue workers right now with a pool smaller than 3. Each worker needs at least 3 connections: one for the polling, one to run the jobs with a thread pool of size 1, and one for the heartbeat 😬

One idea, perhaps, could be to configure Solid Queue with its own pool, instead of using your app's pool, by setting a separate DB configuration 🤔 This can be done even if the DB you're using is the same. I think that should work fine in your case.

rosa avatar Aug 01 '24 09:08 rosa

Hello @rosa, Thank you for your answer and for the idea you suggested. We will try to have a separated DB configuration as suggested.

kersuzananthony avatar Aug 01 '24 11:08 kersuzananthony

Going to close this one as there's nothing else to do on the Solid Queue side, but if you run into trouble with the suggestion let me know or feel free to reopen 🙏

rosa avatar Aug 21 '24 16:08 rosa

Hey @rosa! Sorry for not coming back to you. I think @kersuzananthony tried your suggestion and it worked perfectly! Thanks again 🙏

beauraF avatar Aug 23 '24 08:08 beauraF

Nice! No problem at all, thanks a lot for letting me know! 🙏

rosa avatar Aug 23 '24 09:08 rosa

@rosa We are actually running into a tangential issue, we are consistently running out of db connections in our app on heroku. The DB provisioned is postgres, and has a hard limit of 20 connections. We've had several failures with enqueuing and with deploying failing because the solid queue connections are gobbled up. We also only have a single threaded worker process, so the worker itself should only be accounting for 3 connections, but we consistently run out of connections. They're constantly fluctuating between 7 - 17 connections.

It's hard to pinpoint the culprit, it could be a variety of things: heroku opening up idle connections for it's own internal polling/backup logic, the web threads inappropriately holding onto connections, actual traffic, etc.

We're trying to figure out the math here; we also implemented solid cache and due to the cost limitations of heroku we have the solid cache sitting in the same db as our solid queue tables. We just deployed this yesterday and it's been an issue almost instantly.

Do you have any documentation regarding how the web processes engage and release the db connections? We have puma worker/threads and it seems like there's an equation in here somewhere about what the limits of our pool values are with the cache and queue combined.

dlinch avatar Nov 20 '24 17:11 dlinch

Do you have any documentation regarding how the web processes engage and release the db connections? We have puma worker/threads

Active Record is the one checking out and releasing DB connections, and it'll use as many as you have defined in database.yml in pool. For Puma, it depends on whether you're running on clustered mode and how many threads you're running per worker.

Are your 20 connections limit per process or in total? If it's in total, it's going to be quite tight 🤔 You'll need as many as Puma processes x threads x 2 for Solid Cache (I believe, I'm not 100% sure on this, I think you'll need the one to read/write to the cache from your app processes + another process that does the key expiring) + 3 for the Solid Queue worker + 2 for the Solid Queue supervisor + 2 for the Solid Queue dispatcher + 1 if you're running recurring tasks 🤔

rosa avatar Nov 20 '24 17:11 rosa

@rosa Total, it's a hard limit Heroku gives for some of its Essential tier Postgres DB's. We actually found that this blocked deploys as the release worker tries to connect to the DB to run the db:setup command and fails, as all connections are being used. So regardless of what we define in database.yml we're bound by the limits of Heroku, in this instance, 20.

I know normally this wouldn't be a big issue, as I'm normally used to the pool being something like 50, and in most environments you can simply change the pool value on postgres with a config command, but for PaaS providers like Heroku and anyone else who may manage your DB, the arbitrary pool limit is something we immediately bumped into.

I imagine some minor guidance on the docs would be useful for folks like us who are stuck on something like Heroku for the foreseeable future, as even the smallest Redis instances on Heroku come with 200 connections, so this was always a non-issue and is now definitely an issue or smaller instances. To be clear, our production instance had no problems but this is for staging specifically where we're keeping the resources relatively small to keep costs down.

EDIT

For reference, here is the documentation for the different Postgres add-on tiers, and Essential 0-1 shows they only provide 20 connections.

dlinch avatar Nov 20 '24 19:11 dlinch

@rosa Sorry to bug you again, we just had a production incident that directly relates to us running out of DB connections to our Solid database, and I'm realizing I made some assumptions last month that I think are incorrect. I failed to indicate that we actually have a single DB setup for both SolidCache and SolidQueue. It was a cost-saving maneuver for us to set it up that way.

I made an incorrect assumption that Solidqueue was somehow tied to our Rails apps Puma configuration, but given the fact that it boots up from the SolidQueue::Cli.start command, I now think it has nothing to do with Puma, so I can safely ignore the DB connection count having anything to do with Puma threads/workers from our worker dynos, correct? You also mentioned that ActiveRecord is handling DB connections, but the SolidQueue process is seemingly not communicating with ActiveRecord when it boots up a new thread, and when we did some testing locally this seemed to be the case. If we scaled the threads it would just keep eating up connections regardless of what our Rails DB configuration said should be available.

So I did some digging and based on what you said, the CLI boots up how many processes we configure from the queue.yml, but defaults to 1. It will then spin up 3 threads as a default, a supervisor, and a dispatcher.

This would mean that we have 3 threads, each counting as a "worker" that uses 3 DB connections, or is each thread representing a DB connection and that's where the 3 number came from? Then, + 2 for the single supervisor and +2 for the dispatcher process, total. Independent of the configured threads/processes.

That would lead us to each worker dyno using 13-14 connections, assuming I didn't alter the processes or threads configuration.

For each web connection, you said Processes X Threads X 2 for the cache, can I assume it will require 1 connection for connecting to the queue DB to insert jobs? Or does every Puma thread also take up 3 db connections?

What I'm on the hunt for is an equation to help us count the number of DB connections we will consume, so we can figure out what our max dyno allotment is given a specific Thread and Processes concurrency. I'm hoping you can help here, as we've bumped into this quite a few times already and have only had Solid implemented for less than 6 weeks. No one on my team has had to worry to much about DB connections, and it's just not obvious at all to me how many we can predict to use when we scale our services.

dlinch avatar Dec 17 '24 21:12 dlinch

I made an incorrect assumption that Solidqueue was somehow tied to our Rails apps Puma configuration, but given the fact that it boots up from the SolidQueue::Cli.start command, I now think it has nothing to do with Puma, so I can safely ignore the DB connection count having anything to do with Puma threads/workers from our worker dynos, correct?

Correct!

the SolidQueue process is seemingly not communicating with ActiveRecord when it boots up a new thread, and when we did some testing locally this seemed to be the case. If we scaled the threads it would just keep eating up connections regardless of what our Rails DB configuration said should be available.

Hmm... that doesn't sound possible. If the DB connection pool for Solid Queue is, say 5, and Solid Queue tries to checkout 6 connections, you'll get an error.

the CLI boots up how many processes we configure from the queue.yml, but defaults to 1. It will then spin up 3 threads as a default, a supervisor, and a dispatcher.

Not exactly, these are processes, not threads, and it'll start 3 processes: one supervisor, one worker and one dispatcher. If you have recurring tasks, it'll also start a scheduler for that.

The "3" I mentioned before was just for a single worker. Each worker uses 3 connections as a minimum: one to poll, one for the threads that run jobs and one for the heartbeat. Then 2 for the supervisor and 2 for the dispatcher like you said. That'd be 7 connections at a minimum to run the jobs.

For each web connection, you said Processes X Threads X 2 for the cache, can I assume it will require 1 connection for connecting to the queue DB to insert jobs? Or does every Puma thread also take up 3 db connections?

It should use just 1 connection to enqueue jobs, not 3.

What I'm on the hunt for is an equation to help us count the number of DB connections we will consume, so we can figure out what our max dyno allotment is given a specific Thread and Processes concurrency.

I think, given this very tight restriction, you'd probably be better off using something else, such as Sidekiq or Resque, and Redis for the cache rather than the DB 🤔 20 connections is very low. PostgreSQL default is 100, and MySQL is 200, I believe.

rosa avatar Dec 17 '24 21:12 rosa

Ah, this was on a production application with a DB connection of 400, and we hit it. Some of our specific stats: App: 8 Dynos, Web Concurrency 8, Threads 5. So for each dyno we were clustered with 40 Puma threads total. 320 threads total across all the dynos. Worker: 5 dynos, with SolidQueue processes set to 1 and threads unset, which has a default of 3. So for each dyno, it spins up 3 polling workers in a thread, a supervisor and a dispatcher, meaning each worker dyno should only take up 7 connections. 35 connections across all worker dynos in our case.

This is our beefiest instance, but we completely slammed up into our max threads and ground our app to a halt because we do have a cache hit in an application controller, meaning every request was attempting to hit our Solid DB and was failing because all the other db connections were gobbled up. You were saying that each app thread would take up 2 connections for the Cache, does that mean the bottleneck was actually SolidCache, and we were trying to check out 640 connections, given each thread was trying to check out 2 connections and we had spun up 320 threads total?

dlinch avatar Dec 17 '24 22:12 dlinch

Worker: 5 dynos, with SolidQueue processes set to 1 and threads unset, which has a default of 3. So for each dyno, it spins up 3 polling workers in a thread, a supervisor and a dispatcher, meaning each worker dyno should only take up 7 connections. 35 connections across all worker dynos in our case.

No, if you're configuring 3 threads per worker, each worker will use 3 + 2 (the threads run the jobs. Then you have 1 connection for polling and 1 for the heartbeat). This is what I said earlier: Each worker needs at least 3 connections: one for the polling, one to run the jobs with a thread pool of size 1, and one for the heartbeat. Then 2 for the supervisor and 2 for the dispatcher. That'd be 9 instead of 7. If you're running 5 dynos like this, that'd be 45. You could also run multiple worker processes and save connections, just one supervisor managing multiple worker processes instead of just one with 3 threads. You shouldn't need that many dispatchers, for example.

You were saying that each app thread would take up 2 connections for the Cache

That's what I think, because you'd need one connection to read/write and another to manage expiration (I think Solid Cache uses a thread for that), but I think you should look at Solid Cache's code.

How have you configured your pool size in your database.yml file?

rosa avatar Dec 17 '24 22:12 rosa

@rosa Sorry, there's a lot of shared nomenclature across this infrastructure and differentiating between a SolidQueue worker, a Puma thread, a SolidQueue thread, and a "dyno" worker is making this even trickier to try and talk through. I meant that each thread was 1 connection, and polled to actually run the jobs, so we're in agreement there.

Our pool size is currently set to 5, which was a setting we used doing some local testing and never bothered to change because it seemed like it wasn't being respected anyway. The database.yaml specifies the cache and queue db's with pools of 5, but they point to the same DB attachment in Heroku. This would normally lead me to believe it would only allow 10 total, but in our Heroku PG stats we're seeing ~30ish connections on our staging environment and as I said earlier, our incident today had the connections maxed out at 400. If what you're saying where the yml configuration would ultimately only allow 10, I would have expected errors at deploy time, but we seem to be hitting the DB connection limit at runtime with increased app usage, not whatever limit ActiveRecord is saying we should have.

dlinch avatar Dec 17 '24 22:12 dlinch

a SolidQueue worker, a Puma thread, a SolidQueue thread, and a "dyno" worker is making this even trickier to try and talk through. I meant that each thread was 1 connection, and polled to actually run the jobs, so we're in agreement there.

I don't think we're in agreement because I don't know what "polled to actually run the jobs" means. I'm using process and thread here with their general meaning: a Ruby thread, a process in your operating system. Solid Queue doesn't have a separate definition for these, nor does Puma.

Our pool size is currently set to 5, which was a setting we used doing some local testing and never bothered to change because it seemed like it wasn't being respected anyway.

I don't see how this is not being respected. No process (again I'm referring here to the general definition of process) in your setup above uses more than five connections to your cache/queue DB.

rosa avatar Dec 18 '24 09:12 rosa

I don't want to devolve into semantics, you said:

No, if you're configuring 3 threads per worker, each worker will use 3 + 2 (the threads run the jobs.) I was simply saying that each thread correlated to a single db connection to our Solid database.

What information can I send to you to prove my hypothesis that SolidQueue seems to be ignoring our pool size? Here is an experiment I ran locally:

Configured queue.yml
processes: 1,
threads: 3,
SQL Command to test number of active connections to our DB:
select numbackends from pg_stat_database where datname = '<OUR_SOLID_DB_NAME>';

In another terminal window, I ran the local SolidQueue::CLI daemon, and in a rails console I queued up a bunch of jobs.

The SQL query returned 4, which already doesn't make sense because there should be a connection per thread plus 2, but here are the inputs and outputs I tested for some more sample data. I did test that the backends dropped to zero as I killed the daemon and re-ran it after updating the config file each time.

1 process
------------------
4 threads: 4 backends
5 threads: 4 backends
10 threads: 8 backends

2 processes
------------------
3 threads: 9 backends
4 threads: 11 backend
5 threads: 13 backends
10 threads: 13 backends

3 processes
------------------
3 threads: 12 backends
4 threads: 15 backends
5 threads: 18 backends
10 threads: 18 backends

We blow way past our 5 configured connections, and the forumala in there seems to be Threads X Processes + 3 With 5 threads being our max. The single process results don't seem to conform.

dlinch avatar Dec 18 '24 19:12 dlinch

Could you please read everything I wrote above but more carefully? In particular the part about the different processes Solid Queue runs: supervisor, workers, dispatcher. The quoted phrase refers to the workers only.

Then this:

The SQL query returned 4, which already doesn't make sense

Again, Solid Queue would use at most (this is copying again from another comment):

each worker will use 3 + 2 (the threads run the jobs. Then you have 1 connection for polling and 1 for the heartbeat). This is what I said earlier: Each worker needs at least 3 connections: one for the polling, one to run the jobs with a thread pool of size 1, and one for the heartbeat. Then 2 for the supervisor and 2 for the dispatcher. That'd be 9. So, running Solid Queue as you are, you'll use at most 9 connections. If you increase the threads and the processes, that'll obviously increase.

The concepts of workers, dispatchers, supervisors, etc. are also explained in the README. There's also a section on threads and processes.

I think now that your problem here comes from thinking the Active Record connection pool is global across all processes connecting to your database. It's not. It's per-process. This is also a good read. That's what I've been trying to convey in all my comments about connections per-process, when talking about Puma processes in clustered mode, Solid Queue processes, etc. The pool can't be global. Active Record has no way of knowing which other processes you're running outside its own process, maybe even in other machines.

Solid Queue has no way of not respecting Active Record connection pool size because that's all managed by Active Record, as I've said before as well.

rosa avatar Dec 18 '24 19:12 rosa

@rosa That was a fundamental misunderstanding on my part, thank you for clarifying, and thank you for your patience. This is the first time I've had to try and dive deep on connection pooling stuff and I'm still wrapping my head around how SolidQueue is using processes and threads.

Here is my updated understanding:

Upon initialization, SolidQueue starts up a supervisor and dispatcher process, each of which will take 2 db connections. It will also spin up a scheduler if you have recurring tasks which will take an additional 1 connection. It will also spin up a worker process. I can control the number of workers by changing the processes configuration. Each worker will take up 2 + threads connections, which I also set under the queue.yml configuration. Each process is where the thread pool is respected, so my 2 + threads calculation shouldn't be higher than my configured pool in my database.yml.

You said something in an earlier comment:

You could also run multiple worker processes and save connections, just one supervisor managing multiple worker processes instead of just one with 3 threads. You shouldn't need that many dispatchers, for example.

This comment implies you can have multiple supervisors and dispatchers, is SolidQueue spinning up a dispatcher and a supervisor per queue key? Or is it always a single dispatcher and supervisor total? Given the below yml from the README has two queue keys under workers.

production:
  dispatchers:
    - polling_interval: 1
      batch_size: 500
      concurrency_maintenance_interval: 300
  workers:
    - queues: "*"
      threads: 3
    - queues: [ real_time, background ]
      threads: 5
      processes: 3

dlinch avatar Dec 18 '24 21:12 dlinch