solid_queue After a while, the workers stop pulling jobs from the queues, and I need to restart them

Hey there! We've been using SolidQueue for a while now, and we're really happy with it. But there's one thing that's been bugging us a bit. Sometimes, the workers just stop pulling jobs from the Postgres queues, and there's no error message or anything that helps us figure out what's going on. The workers don't crash either.

The only thing I've found that seems to work is to restart the workers. After that, they start pulling jobs from the queues again and processing them just fine.

I've been looking through the existing issues on Github, and I found one thread that describes a similar problem. But that one was related to the dev mode and was closed a while back. I remember having that issue back then too.

Do you have any suggestions of things I could try? Or is it just a bug?

Thanks a bunch!

Jun 24 '25 10:06 vitobotta

Perhaps I should add a bit more details in case they might help:

The workers are deployed in Kubernetes
SolidQueue is using the main Postgres database
The workers are connected to the database via a connection pooler (pgbouncer) in transaction mode.

Jun 24 '25 10:06 vitobotta

Hey @vitobotta, with that information, I'm not quite sure what could be going on. I haven't experienced this or heard about anyone experiencing this 😕 I think what I'd do is to try to figure out what the workers are doing via strace or similar, and checking SQL logs if you can, to know whether the workers just stopped polling completely (setting ActiveRecord::Base.logger to 0 for all Solid Queue processes should log all the queries I think).

Jun 24 '25 14:06 rosa

Hi @rosa ! What should I be looking for with something like strace? For now, I was thinking to set config.solid_queue.silence_polling = false so to see the messages when the polling happens. Is this a good idea or would it cause too much logging?

Jun 24 '25 16:06 vitobotta

What should I be looking for with something like strace?

I'm not sure! 😅 It'd be to get an idea of what the workers are doing when they're not polling.

I was thinking to set config.solid_queue.silence_polling = false

Ah, yes! I had forgotten about that setting. It'll cause a lot of logging while it's working fine depending on your configuration (number of workers, polling intervals) but it could give some clues about what's going on when they get stuck.

Jun 24 '25 16:06 rosa

I will go ahead with that then and let you know how it goes. Thanks!

Jun 24 '25 16:06 vitobotta

I think we're facing the same issue. We are running on Google Cloud Run using the puma plugin. We have a min/max instance size of 1 on cloud run so there shouldn't be anything weird about the environment. I've noticed that jobs will stop running at random times. When I check the "Workers" section of mission control, there are none listed. When I check my application logs, I see an "Enqueued RetryAbandondedSolidQueueJobsJob" event and then nothing else but web requests.

Maybe this is an issue with the puma monitor or solid queue supervisor? I'll try running SolidQueue from a Procfile to see if it provides any other info.

Jul 22 '25 14:07 film42

Hey @film42, thanks for the details!

I see an "Enqueued RetryAbandondedSolidQueueJobsJob" event

What exactly is this job doing? 🤔

Jul 22 '25 14:07 rosa

Hey @rosa , I forgot that was one my own jobs. I don't think the job itself is the problem. But, it could be anything.

As a side-bar, the job looks for abandoned jobs (something we saw more of before we switched cloud run to have a minimum number of instances) and ensures they are retried. Most of our jobs have retries enabled, and pruned or exited jobs would sometimes get stuck before they used up all their retries.

class RetryAbandonedSolidQueueJobsJob < ApplicationJob
  def perform
    abandonded_jobs.find_each do |job|
      job.retry
    end
  end

  def abandonded_jobs
    SolidQueue::FailedExecution.where(
      "error::jsonb->>'exception_class' in (?)",
      # Use class.name here to ensure errors don't drift.
      [
        SolidQueue::Processes::ProcessPrunedError.name,
        SolidQueue::Processes::ProcessExitError.name,
      ]
    )
  end
end

Jul 22 '25 15:07 film42

Oh, ok, thanks @film42, the name made me think you could be somehow breaking Solid Queue's internal state by deleting things in a strange way, but using FailedExecution#retry is ok.

Maybe you could try setting config.solid_queue.silence_polling = false and share logs for when the issue happens, that you don't see any more workers and jobs aren't being executed. This will generate a lot of logs, though, so it'd be feasible depending on how long the issue takes to happen 🤔

Jul 22 '25 15:07 rosa

Sounds good. I'll turn on polling logs and see what happens.

Jul 22 '25 15:07 film42

I was planning to come here and update the thread already, but then I saw the notification of new messages. We are still having this issue. I was on holiday, so I haven't tried disabling the polling logging yet. I didn't want to do it while I was away in case it generated too much logging. But the team told me they had to restart the workers again for the same issue. No crashes, no errors, no useful information in the logs. The workers just seem to freeze and stop processing jobs for some reason. Our Kubernetes clusters are very stable, so I don't think the problem is with the infrastructure. I will disable the silencing of the polling tomorrow in our sandbox cluster and see when it happens again.

Jul 22 '25 16:07 vitobotta

@vitobotta are you using the puma plugin as well or running ./bin/jobs in a separate pod? When the workers freeze do you still have workers listed in the "Workers" tab of mission control? (In my case the list is empty).

Jul 22 '25 16:07 film42

Hi @film42, we aren't using the puma plugin. We just have dedicated worker pods. I can't remember if I checked the workers tab in mission control the last time I saw the issue. I'll keep an eye on it too.

Jul 22 '25 17:07 vitobotta

Just a quick update here. I enabled debug logging and for two days now it's working perfect, haha. I'll report back when it eventually stops again.

Jul 24 '25 03:07 film42

Here is a screenshot from logs focused on "SolidQueue::ReadyExecution Exists?". They drop out around 15:03. I deployed at 15:15 and jobs start to run, and then the workers died again.

@rosa I don't think I'm running out of db connections here but does SQ have tests around an exhausted pool and what might happen? I can try increasing my pool size to see if this happens again.

There are some SQ db calls, and they're primarily:

SolidQueue::ScheduledExecution Pluck
SolidQueue::BlockedExecution Pluck
SolidQueue::Semaphore Pluck
SolidQueue::Process Load
SolidQueue::Process Update

Even with these failures, I would expect one of the supervising threads to recreate the worker once it disappears. Is there a way to instrument this process to see if it stops ticking?

Jul 24 '25 21:07 film42

Since upping my db pool size, I haven't had any issues. I wonder if there is an unhandled error in one of the manager threads that pops if a pool timeout hits? I haven't seen anything in the logs for pool timeouts, though :/. Not really a scientific finding, but that's the only thing I changed.

Aug 01 '25 18:08 film42

Thanks @film42, I am gonna try this as well! How did you configure the pool size?

Aug 01 '25 18:08 vitobotta

I have my app and solid_queue share a database pool (for better or worse), so I just increased the number of connections in the main pool (now set to 50). With pgbouncer you can probably go much higher, just keep an eye on your pgbouncer stats to track checkout latency.

Aug 01 '25 18:08 film42

Thanks @film42 - I will increase the pool size and check if it helps.

Aug 04 '25 20:08 vitobotta

Just an update. @rosa I've had zero issues since increasing my active record db pool size. Same SQ version the whole time. Hopefully I can reproduce locally at some point. I'd really like to know why it ended up in a broken state. Or, why the workers were not registered (assumed dead) but the processes polling for pid events didn't recreate them.

Aug 12 '25 16:08 film42

solid_queue solid_queue copied to clipboard

After a while, the workers stop pulling jobs from the queues, and I need to restart them

solid_queue
solid_queue copied to clipboard