solid_queue icon indicating copy to clipboard operation
solid_queue copied to clipboard

SolidQueue crashes if database connection is lost, and takes Puma with it.

Open darinwilson opened this issue 10 months ago • 8 comments

We're running SolidQueue as a Puma plugin on a Rails 8 app, as our job processing load is currently quite small.

We recently had an incident where the server running Puma temporarily lost the connection to Postgres. This caused SolidQueue to crash with this message:

PQconsumeInput() FATAL:  terminating connection due to administrator command (PG::
ConnectionBad)
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

and this in turn took down Puma:

Detected Solid Queue has gone away, stopping Puma...
- Gracefully stopping, waiting for requests to finish

I was able to reproduce this locally by shutting down Postgres after starting Rails.

When running Rails without the SolidQueue Puma plugin, if the database goes away, Rails throws an error when it tries to do something with the database, but Puma stays up and the connections recover when the database comes back online.

If I run SolidQueue separately, via bin/jobs, it also crashes if the database goes away.

Obviously SolidQueue can't be expected to do much without a database, but would it be reasonable for it to behave as Rails does when the db goes offline, i.e. pause its activity and reconnect when the db is available again?

Thanks for all your work on this - SolidQueue has been a fantastic addition to Rails!

darinwilson avatar Feb 07 '25 01:02 darinwilson

Oh, interesting. This happens for the supervisor only, if any of the supervised processes crashes, the supervisor makes sure a new one is started 🤔 I think the supervisor would need some kind of recovery mechanism if the DB fails, but it could also crash for other reasons. I think it makes sense to do this, but I won't have time in the next couple of months at least, so if someone wants to submit a PR doing this, I'll be happy to review.

rosa avatar Feb 10 '25 21:02 rosa

Thanks for the feedback - that's good to know that it must be something at the supervisor level.

I'll dig into the code a bit, and see if I can find a solution that might work.

darinwilson avatar Feb 10 '25 21:02 darinwilson

Had this same issue myself last week, where DB was briefly uncontactable, causing SolidQueue to shut down Puma and Rails.

What are the chances of the PR getting merged soon?

asgeo1 avatar Apr 24 '25 06:04 asgeo1

Oh, I completely forgot about this one, sorry! I'll take a look at the PR.

rosa avatar Apr 24 '25 08:04 rosa

I'm struggling to reproduce this locally so I can test the PR and an alternative approach. In all cases both Puma and Solid Queue remain running 😕 This is quite strange. @darinwilson, you said:

I was able to reproduce this locally by shutting down Postgres after starting Rails.

Are you running different PostgreSQL instances for your app and for Solid Queue? I haven't managed to reproduce with a single instance (and multiple DBs, basically the default you get in this repo). I've also tried different things with MySQL: dropping the DB, stopping the server...

rosa avatar Apr 24 '25 09:04 rosa

@rosa I created a minimal Rails app that demonstrates the issue (at least on my machine 😅)

It uses the same setup you mentioned (single instance, multiple DBs), with Solid Queue running inside Puma (although I also see the crash just running bin/jobs). This is on Apple silicon - not sure if that makes a difference.

I put some instructions in the README, but it's basically: start the app, see Solid Queue running in the log output, kill Postgres, watch Solid Queue and Puma crash.

Let me know if you're still not able to repro. Thanks!

darinwilson avatar Apr 29 '25 05:04 darinwilson

I just ran into this on Heroku, the database restarted and when this happened the application itself generated a huge amount of stacktrace and then crashed. The stacktrace ended with: Detected Solid Queue has gone away, stopping Puma....

rbclark avatar May 25 '25 02:05 rbclark

Hello. Got an unattended upgrade this morning on my ubuntu machine (development). PG been bumped from 16.9 to 16.10. Solid queue crashed the app. Restarted without problem. Running 1.2.1.

maxence33 avatar Sep 09 '25 10:09 maxence33