solid_queue
solid_queue copied to clipboard
Are there any plans to make Solid Queue more performant on Postgres?
Hey there, after running Solid Queue for a few months, we're seeing occasional spikes in our CPU usage. From what I understand in the README, there are some limitations due to Postgres not implementing loose index scan and I've ensured that my configuration accounts for these limitations. For example, my queue is defined to be *, and we've avoided pausing queues.
But we still sometimes see a spike in the query execution time for SELECT solid_queue_ready_execution. It seems like this query is already using an index scan but somehow the query still takes almost half a second to run, and it was executed over 500k times within the span of around 15 minutes. This is coming from Solid Queue itself, and not from me viewing the dashboard on Mission Control, but I also see similar queries that take this long when I view the Queues tab in Mission Control.
Here's some screenshots of the said query:
Are there any plans to make Solid Queue more performant on Postgres? Thank you!
Hey @salmonsteak1, I'm afraid that query is already completely optimized, but I've got another improvement in mind to reduce the polling frequency.
It seems like this query is already using an index scan but somehow the query still takes almost half a second to run
Is it possible your DB is under some additional load unrelated to this that makes it so slow?
Hey @rosa, we've actually increased our polling interval to 0.5 seconds as the default of 0.1s was consuming way too much CPU
We're running Solid Queue on a separate Cloud SQL instance that only has a single DB that's dedicated to solid queue, so I doubt that the DB is under any additional load
I'd actually like to ask if a fix for the pausing of queues on a Postgres DB is on the roadmap anytime soon? Due to the nature of our traffic, we wanted to use that feature to pause specific queues when there is a surge in jobs. Given the current situation, pausing it under such surges would put even more load on the DB which really isn't ideal. Thanks!
I'd actually like to ask if a fix for the pausing of queues on a Postgres DB is on the roadmap anytime soon?
Not anytime soon, no. There's a way to improve it with a recursive CTE like proposed here, but it needs to be different than that proposal.
The improvement I have in mind to reduce the polling frequency would help in your case I think, (the idea is to configure and poll more jobs than what the worker has capacity for, and keep them in memory) but even with that, you'd still see bad performance. I don't know why your DB is so slow in these cases. How many rows do you have normally in the solid_queue_ready_executions table? It feels like something is off there.
We'd normally have around <100 jobs in our solid_queue_ready_executions table. Occasionally, there will be a spike in the number of jobs due to a surge in traffic, and I think that's when these long queries occur
Hey @rosa, I was also wondering if you guys are setting preserve_finished_jobs to false? I'm thinking if that might be causing some additional overhead to the database (on top of what was mentioned in this thread). Thanks!
On top of that, I'd like to ask if you guys did any tuning to your MySQL DB when running Solid Queue on your end (e.g. setting specific flags, or using a connection pooler, etc)? It seems like we have around the same amount of traffic based on what you've mentioned here, but I'm not sure if you guys experience any occasional spikes in traffic or its more of a consistent number that can be predicted. I'm considering using MySQL for our Solid Queue DB if it's really kind of just a "set and forget" implementation.
Hey @rosa, sorry for pinging again, but just checking to see if you have any inputs on this!
Hey, so sorry for the delay! Missed the last couple of notifications!
if you guys are setting
preserve_finished_jobstofalse?
No, we keep that to true, and then we periodically delete finished jobs like this:
clear_solid_queue_finished_jobs:
command: SolidQueue::Job.clear_finished_in_batches(sleep_between_batches: 0.3)
schedule: every hour at minute 12 and 42
We also keep the default for
config.solid_queue.clear_finished_jobs_after = 1.day
Our solid_queue_jobs table is pretty big, usually around 35M jobs. The solid_queue_ready_executions table, however, is normally very small (< 100), but we do have spikes where we have more jobs, normally as a result of a cron job enqueuing many. We might get to 50K to 100K or so. We never experience any slow polling queries 😕
We didn't tune MySQL in any special way for Solid Queue. The DB follows the same configuration as our main app. Maybe it's a matter of hardware? We run Solid Queue DB in a decently big machine, with an upper bound of 32 CPUs assigned (the machine is bigger but it's shared with other DBs).
@rosa Thank you for the information!
A few more questions, are the Solid Queue workers directly hitting the MySQL DB? Would you mind sharing the average number of connections your DB usually has? Because right now we're using pgbouncer as a connection pooler for our Solid Queue DB.
Yes, quite a lot! We have around 2,000 connections. This is from both the workers, supervisor, dispatchers, and schedulers, but also from the app servers when we enqueue jobs. All of them connect directly, yes.
Can you monitor the table's dead_tuples? Frequent deletes can bring outdated statistics, which, combined with the dead_tuples, can slow down the table.
One quick and dirty idea is to truncate the solid_queue_ready_executions when it's empty to discard all dead tuples at once instead of leaving the work for the vacuum.
If it's a consistent tuples bloat, another alternative would be to install the timescaledb extension in PostgreSQL and make the solid_queue_ready_executions a hypertable. Then, through time series partitioned chunks, the table bloat would be much smaller, and old partitions could be truncated without locking the table receiving data.
Hey thanks for the tips @jonatas! It looks like TRUNCATE requires an exclusive lock which might not be ideal since I'm pretty sure there are frequent read/writes to the solid_queue_ready_executions table. I'm also running the db on CloudSQL, so installing the timescaledb extension is not possible 😕
Right now I'm running a VACUUM ANALYZE operation every few seconds or so on the table and it seems like it helps abit, but sometimes there will still be spikes in the CPU consumption, presumably due to LWLock:buffer_content
Good point @salmonsteak1. The exclusive lock would be a deal with the number of connections open. A few other tweaks in the vacuum config that may help:
ALTER TABLE solid_queue_ready_executions SET (
autovacuum_vacuum_threshold = 10, -- Much lower than the default 50
autovacuum_vacuum_scale_factor = 0, -- Ignore table size completely
autovacuum_freeze_min_age = 10000000, -- Default is 50 million
vacuum_freeze_table_age = 50000000, -- Default is 150 million
fillfactor = 70 -- Leave space for HOT updates
);
Setting a very low threshold (10 rows) ensures the vacuum runs frequently, even when the table contains very few rows.
Zero scale factor: By setting the scale factor to 0, you're telling PostgreSQL to ignore the table size entirely and use the absolute threshold.
Lower freeze ages: Queue tables often see high transaction churn, so lowering the freeze age parameters helps prevent transaction ID wraparound issues.
Also, the fill factor can help here. It provides space for HOT updates, though this is less critical in a pure queue table with mostly inserts and deletes.
Hey @jonatas, would these values help if I'm already triggering a manual vacuum every few seconds or so? From what I've observed, I often see these "cancelling autovacuum task" errors, presumably because these autovacuum operations are conflicting with my manual vacuum that I'm executing via pgcron. The reason for doing it this way is because I suspect the autovacuum operations somehow gets cancelled and other table operations take precedence over it, which ultimately results in the spikes. Or at least that's what I've observed when I tried to rely solely on autovacuum
I've also previously tried to tune autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor similarly to what you've recommended, but to no avail
Thanks for your help!
I got it; thanks for sharing. These are complex dancing between scheduled tasks vs autonomous vacuum trying to get some work done 🤓
If you get any other lessons, I'm continuously learning from your comments and observations too.
Going to close this one as the query is completely optimised and any other solution would most likely be PostgreSQL specific. Thanks @salmonsteak1 and @jonatas for all the info and debugging 🙏
Are you really not considering doing DB-specific optimizations? What speaks against adding some of these for each DB technology?
@jonatas mentioned pretty good optimizations! Other similar projects (Db-specific) do that even more extensively, like https://github.com/janbjorge/pgqueuer/blob/main/pgqueuer/qb.py#L547
ALTER TABLE queue_table SET (
/* vacuum very early: after ~1 %% or 1 000 dead rows */
autovacuum_vacuum_scale_factor = 0.01, -- maybe better than 0 which could lead to the cancelled statements!
autovacuum_vacuum_threshold = 1000, -- To be tested towards the suggested 10!
/* analyse even earlier so the planner keeps up-to-date */
autovacuum_analyze_scale_factor = 0.02,
autovacuum_analyze_threshold = 500,
/* work fast: spend ≈10× the default IO budget */
autovacuum_vacuum_cost_limit = 10000,
autovacuum_vacuum_cost_delay = 0, -- no 20 ms naps
/* leave headroom for HOT updates so fewer dead tuples */
fillfactor = 70
/* add here suggested points? */
autovacuum_freeze_min_age = 10000000, -- Default is 50 million
vacuum_freeze_table_age = 50000000, -- Default is 150 million
@MRigal it may worth to just pack it in a solid_queue-pg gem for those that could benefit from it.
Putting on the official project may require maintainers to get specialized in Postgresql.