solid_queue
solid_queue copied to clipboard
After a while, the workers stop pulling jobs from the queues, and I need to restart them
Hey there! We've been using SolidQueue for a while now, and we're really happy with it. But there's one thing that's been bugging us a bit. Sometimes, the workers just stop pulling jobs from the Postgres queues, and there's no error message or anything that helps us figure out what's going on. The workers don't crash either.
The only thing I've found that seems to work is to restart the workers. After that, they start pulling jobs from the queues again and processing them just fine.
I've been looking through the existing issues on Github, and I found one thread that describes a similar problem. But that one was related to the dev mode and was closed a while back. I remember having that issue back then too.
Do you have any suggestions of things I could try? Or is it just a bug?
Thanks a bunch!
Perhaps I should add a bit more details in case they might help:
- The workers are deployed in Kubernetes
- SolidQueue is using the main Postgres database
- The workers are connected to the database via a connection pooler (pgbouncer) in transaction mode.
Hey @vitobotta, with that information, I'm not quite sure what could be going on. I haven't experienced this or heard about anyone experiencing this 😕 I think what I'd do is to try to figure out what the workers are doing via strace or similar, and checking SQL logs if you can, to know whether the workers just stopped polling completely (setting ActiveRecord::Base.logger to 0 for all Solid Queue processes should log all the queries I think).
Hi @rosa ! What should I be looking for with something like strace? For now, I was thinking to set config.solid_queue.silence_polling = false so to see the messages when the polling happens. Is this a good idea or would it cause too much logging?
What should I be looking for with something like
strace?
I'm not sure! 😅 It'd be to get an idea of what the workers are doing when they're not polling.
I was thinking to set config.solid_queue.silence_polling = false
Ah, yes! I had forgotten about that setting. It'll cause a lot of logging while it's working fine depending on your configuration (number of workers, polling intervals) but it could give some clues about what's going on when they get stuck.
I will go ahead with that then and let you know how it goes. Thanks!
I think we're facing the same issue. We are running on Google Cloud Run using the puma plugin. We have a min/max instance size of 1 on cloud run so there shouldn't be anything weird about the environment. I've noticed that jobs will stop running at random times. When I check the "Workers" section of mission control, there are none listed. When I check my application logs, I see an "Enqueued RetryAbandondedSolidQueueJobsJob" event and then nothing else but web requests.
Maybe this is an issue with the puma monitor or solid queue supervisor? I'll try running SolidQueue from a Procfile to see if it provides any other info.
Hey @film42, thanks for the details!
I see an "Enqueued RetryAbandondedSolidQueueJobsJob" event
What exactly is this job doing? 🤔
Hey @rosa , I forgot that was one my own jobs. I don't think the job itself is the problem. But, it could be anything.
As a side-bar, the job looks for abandoned jobs (something we saw more of before we switched cloud run to have a minimum number of instances) and ensures they are retried. Most of our jobs have retries enabled, and pruned or exited jobs would sometimes get stuck before they used up all their retries.
class RetryAbandonedSolidQueueJobsJob < ApplicationJob
def perform
abandonded_jobs.find_each do |job|
job.retry
end
end
def abandonded_jobs
SolidQueue::FailedExecution.where(
"error::jsonb->>'exception_class' in (?)",
# Use class.name here to ensure errors don't drift.
[
SolidQueue::Processes::ProcessPrunedError.name,
SolidQueue::Processes::ProcessExitError.name,
]
)
end
end
Oh, ok, thanks @film42, the name made me think you could be somehow breaking Solid Queue's internal state by deleting things in a strange way, but using FailedExecution#retry is ok.
Maybe you could try setting config.solid_queue.silence_polling = false and share logs for when the issue happens, that you don't see any more workers and jobs aren't being executed. This will generate a lot of logs, though, so it'd be feasible depending on how long the issue takes to happen 🤔
Sounds good. I'll turn on polling logs and see what happens.
I was planning to come here and update the thread already, but then I saw the notification of new messages. We are still having this issue. I was on holiday, so I haven't tried disabling the polling logging yet. I didn't want to do it while I was away in case it generated too much logging. But the team told me they had to restart the workers again for the same issue. No crashes, no errors, no useful information in the logs. The workers just seem to freeze and stop processing jobs for some reason. Our Kubernetes clusters are very stable, so I don't think the problem is with the infrastructure. I will disable the silencing of the polling tomorrow in our sandbox cluster and see when it happens again.
@vitobotta are you using the puma plugin as well or running ./bin/jobs in a separate pod? When the workers freeze do you still have workers listed in the "Workers" tab of mission control? (In my case the list is empty).
Hi @film42, we aren't using the puma plugin. We just have dedicated worker pods. I can't remember if I checked the workers tab in mission control the last time I saw the issue. I'll keep an eye on it too.
Just a quick update here. I enabled debug logging and for two days now it's working perfect, haha. I'll report back when it eventually stops again.
Here is a screenshot from logs focused on "SolidQueue::ReadyExecution Exists?". They drop out around 15:03. I deployed at 15:15 and jobs start to run, and then the workers died again.
@rosa I don't think I'm running out of db connections here but does SQ have tests around an exhausted pool and what might happen? I can try increasing my pool size to see if this happens again.
There are some SQ db calls, and they're primarily:
SolidQueue::ScheduledExecution PluckSolidQueue::BlockedExecution PluckSolidQueue::Semaphore PluckSolidQueue::Process LoadSolidQueue::Process Update
Even with these failures, I would expect one of the supervising threads to recreate the worker once it disappears. Is there a way to instrument this process to see if it stops ticking?
Since upping my db pool size, I haven't had any issues. I wonder if there is an unhandled error in one of the manager threads that pops if a pool timeout hits? I haven't seen anything in the logs for pool timeouts, though :/. Not really a scientific finding, but that's the only thing I changed.
Thanks @film42, I am gonna try this as well! How did you configure the pool size?
I have my app and solid_queue share a database pool (for better or worse), so I just increased the number of connections in the main pool (now set to 50). With pgbouncer you can probably go much higher, just keep an eye on your pgbouncer stats to track checkout latency.
Thanks @film42 - I will increase the pool size and check if it helps.
Just an update. @rosa I've had zero issues since increasing my active record db pool size. Same SQ version the whole time. Hopefully I can reproduce locally at some point. I'd really like to know why it ended up in a broken state. Or, why the workers were not registered (assumed dead) but the processes polling for pid events didn't recreate them.