solid_queue icon indicating copy to clipboard operation
solid_queue copied to clipboard

SolidQueue::Processes::ProcessExitError

Open gczh opened this issue 1 year ago • 9 comments

I setup my solid_queue.yml to run a recurring task every 30s that checks for records to sync

default: &default
  dispatchers:
    - polling_interval: 1
      batch_size: 500
      recurring_tasks:
        sync_records:
          class: SyncRecordsJob
          schedule: "*/30 * * * * *"
  workers:
    - queues: "*"
      threads: 3
      processes: <%= ENV.fetch("JOB_CONCURRENCY", 1) %>
      polling_interval: 0.1

development:
  <<: *default

test:
  <<: *default

production:
  <<: *default

The SyncRecordsJob does a SyncBatchJob.perform_later(record_ids) and it's expected to take some time to run as they have to process quite a number of records (aprox 30s - 1min).

This has been deployed to Render on a $25/mo instance with 1 CPU and 2GB RAM. Initially, on deploy, some of the jobs execute successfully after some time, SyncRecordsJob accumulates in the In Progress list and seems to never process. They don't seem to accumulate anymore either.

~Locally, these jobs run well and don't seem to have any issues.~ Locally, it seems that the recurring job is getting enqueued but never executing. Here's what I see from Mission Control in production for the jobs that failed:

CleanShot 2024-09-08 at 18 20 16@2x

Questions

  1. Are there ways to debug this?
  2. Is it a matter of not having enough resources?

gczh avatar Sep 08 '24 10:09 gczh

Looks like the workers processing these jobs are crashing or being killed somehow. Can you access logs in that instance to see what it might be?

rosa avatar Sep 08 '24 10:09 rosa

Locally, it seems that the recurring job is getting enqueued but never executing.

Do you have local logs for what your Solid Queue worker is doing?

rosa avatar Sep 08 '24 10:09 rosa

Looks like the workers processing these jobs are crashing or being killed somehow. Can you access logs in that instance to see what it might be?

@rosa i'm trying that now. Looking at the docs to see how I could mute the solid_queue logs as they're drowning out my local logs 😆

gczh avatar Sep 08 '24 10:09 gczh

config.solid_queue.silence_polling option is set to true by default, so you shouldn't see the polling queries (which are usually the most noisy). Then you can also set a separate logger for Solid queue via config.solid_queue.logger, by default is Rails.logger. You can use a higher level there, such as:

config.solid_queue.logger = ActiveSupport::Logger.new(STDOUT, level: :info)

rosa avatar Sep 08 '24 10:09 rosa

Or if you want to mute the Solid Queue logs completely, you can set:

config.solid_queue.logger = ActiveSupport::Logger.new(nil)

rosa avatar Sep 08 '24 10:09 rosa

@rosa thanks for those! I managed to get use the logs to figure out what was happening.

What happened was that the job was trying to find records to sync, but there weren't any (due to a faulty where clause)

However, it's still odd that the job would throw a SolidQueue::Processes::ProcessExitError instead of being marked as completed.

gczh avatar Sep 08 '24 14:09 gczh

However, it's still odd that the job would throw a SolidQueue::Processes::ProcessExitError instead of being marked as completed.

Yes, that's what would happen. SolidQueue::Processes::ProcessExitError is not thrown by the job. That's done by Solid Queue when the worker that's processing that job terminates ungraciously. In that case above, it terminated with an error exit code (1). Something external to Solid Queue is most likely killing that worker, but you might not see that locally because it might happen only in your Render instance. That's where you need to investigate to find out why the worker is crashing like that.

rosa avatar Sep 08 '24 14:09 rosa

@rosa When we're deploying new version of our app, we kill off the queue by running sudo systemctl restart solid_queue which effectively runs a /bin/kill -TSTP $MAINPID command that is triggering a SolidQueue::Processes::ProcessExitError for any jobs that are currently executing/running.

Is there a more graceful way to restart the queue that we should be doing? We have all the processes (supervisor, scheduler, workers, etc) running on the one machine.

Appreciate your thoughts on graceful shutdown of executing jobs. Especially as we have some long running ones.

dglancy avatar Sep 13 '24 14:09 dglancy

Hey @dglancy! Could you send a TERM or INT signal to the process as described here? That, combined with SolidQueue.shutdown_timeout, will try a graceful termination first, so workers won't pick any more jobs, and will try to finish what they're doing. If they can't finish, the jobs will be released back and shouldn't fail with a ProcessExitError.

rosa avatar Sep 13 '24 14:09 rosa

Hi @rosa I've also been experiencing this issue since I recently updated the solid_queue version. I don't have any logs either, so I'm currently investigating from various angles. If I come across any useful information, I'll definitely share it with you :)

hahwul avatar Oct 21 '24 01:10 hahwul

We got a similar error when ActiveRecord could not obtain a connection from the pool, we share a Postgres DB between the app and SQ, with 3 workers using each 20 threads and a DB pool size of 20 as well.

/usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:128:in `block in wait_poll': could not obtain a connection from the pool within 5.000 seconds (waited 5.010 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)

We're still investigating the issue, although we have a hunch that SQ might need a few more connections for supporting processes/threads. AFAIK the pool should be per-process, but I also need to double check on that.


Anyway my understanding is that if the worker has no access to the database it will hardly be able to store the exception details and the supervisor will just see the process being terminated. @rosa (👋❕) what do you think about suggesting to check the worker logs in the error message?

Process pid=8646 exited unexpectedly. Exited with status 1. Check the worker logs for further details.

elia avatar Oct 23 '24 10:10 elia

Oops, sorry for the delay here! I missed the last notification back in October 😳

About the connection pool errors, the configuration now has a validation step for the connection pool size and will show an error when you start solid queue if the pool is too small, so that should prevent that error, @elia. 20 threads per worker will need more than 20 connections in the pool because the worker also needs to poll and send the heartbeat, and will use two different connections for that.

Anyway my understanding is that if the worker has no access to the database it will hardly be able to store the exception details and the supervisor will just see the process being terminated.

It won't be able to store the exception details for a job, but this error will happen on the worker itself, outside the job (in your case above it happened on polling), and the error will be reported via on_thread_error (which defaults to

  -> (exception) { Rails.error.report(exception, handled: false) }

rosa avatar Feb 05 '25 10:02 rosa

Going to close this one as there's nothing else to do on the solid queue side! 🙏

rosa avatar Feb 05 '25 10:02 rosa