solid_queue icon indicating copy to clipboard operation
solid_queue copied to clipboard

`PG::ConnectionBad` error when restarting container running `solid_queue:start`

Open InteNs opened this issue 1 year ago • 15 comments

Hey there!

We are moving towards SolidQueue for our apps, we started with just one and have that one running in production for a while now.

One issue that we are seeing is a PG::ConnectionBad error being raised when the container is killed (either when deploying or nightly rebalancing of containers)

Is there anything we can do to stop this error from being raised? Ignoring it means we also ignore actual database connection issues when something goes wrong.

Any help with debugging this is greatly appreciated!


the error has this message:

PQconsumeInput() server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

From what i can gather it looks like the supervisor trying to deregister maybe after the connection was closed already?

here is the full backtrace of the error
lib/active_record/connection_adapters/postgresql_adapter.rb:894 exec_params |  
lib/active_record/connection_adapters/postgresql_adapter.rb:894 block (2 levels) in exec_no_cache |  
lib/active_record/connection_adapters/abstract_adapter.rb:1028 block in with_raw_connection |  
lib/active_support/concurrency/null_lock.rb:9 synchronize |  
lib/active_record/connection_adapters/abstract_adapter.rb:1000 with_raw_connection |  
lib/active_record/connection_adapters/postgresql_adapter.rb:893 block in exec_no_cache |  
lib/active_support/notifications/instrumenter.rb:58 instrument |  
lib/active_record/connection_adapters/abstract_adapter.rb:1143 log |  
lib/active_record/connection_adapters/postgresql_adapter.rb:892 exec_no_cache |  
lib/active_record/connection_adapters/postgresql_adapter.rb:872 execute_and_clear |  
lib/active_record/connection_adapters/postgresql/database_statements.rb:64 internal_exec_query |  
lib/active_record/connection_adapters/abstract/database_statements.rb:630 select |  
lib/active_record/connection_adapters/abstract/database_statements.rb:71 select_all |  
lib/active_record/connection_adapters/abstract/query_cache.rb:115 select_all |  
lib/active_record/relation/calculations.rb:477 block in execute_simple_calculation |  
lib/active_record/relation.rb:1018 skip_query_cache_if_necessary |  
lib/active_record/relation/calculations.rb:476 execute_simple_calculation |  
lib/active_record/relation/calculations.rb:433 perform_calculation |  
lib/active_record/relation/calculations.rb:245 calculate |  
lib/active_record/relation/calculations.rb:102 count |  
lib/active_record/associations/has_many_association.rb:84 count_records |  
lib/active_record/associations/collection_association.rb:218 size |  
lib/active_record/associations/collection_proxy.rb:783 size |  
app/models/solid_queue/process.rb:28 deregister |  
lib/solid_queue/processes/registrable.rb:27 deregister |  
lib/active_support/callbacks.rb:403 block in make_lambda |  
lib/active_support/callbacks.rb:239 block in halting_and_conditional |  
lib/active_support/callbacks.rb:602 block in invoke_after |  
lib/active_support/callbacks.rb:602 each |  
lib/active_support/callbacks.rb:602 invoke_after |  
lib/active_support/callbacks.rb:111 run_callbacks |  
lib/solid_queue/supervisor.rb:33 start |  
lib/solid_queue/supervisor.rb:12 start |  
lib/solid_queue/tasks.rb:4 block (2 levels) in <top (required)> |  
lib/rake/task.rb:281 block in execute |  
lib/rake/task.rb:281 each |  
lib/rake/task.rb:281 execute |  
lib/appsignal/integrations/rake.rb:15 block in execute |  
lib/appsignal/transaction.rb:913 instrument |  
lib/appsignal/helpers/instrumentation.rb:995 instrument |  
lib/appsignal/integrations/rake.rb:14 execute |  
lib/rake/task.rb:219 block in invoke_with_call_chain |  
lib/rake/task.rb:199 synchronize |  
lib/rake/task.rb:199 invoke_with_call_chain |  
lib/rake/task.rb:188 invoke |  
lib/rake/application.rb:188 invoke_task |  
lib/rake/application.rb:138 block (2 levels) in top_level |  
lib/rake/application.rb:138 each |  
lib/rake/application.rb:138 block in top_level |  
lib/rake/application.rb:147 run_with_threads |  
lib/rake/application.rb:132 top_level |  
lib/rails/commands/rake/rake_command.rb:27 block (2 levels) in perform |  
lib/rake/application.rb:214 standard_exception_handling |  
lib/rails/commands/rake/rake_command.rb:27 block in perform |  
lib/rails/commands/rake/rake_command.rb:44 block in with_rake |  
lib/rake/rake_module.rb:59 with_application |  
lib/rails/commands/rake/rake_command.rb:41 with_rake |  
lib/rails/commands/rake/rake_command.rb:20 perform |  
lib/rails/command.rb:156 invoke_rake |  
lib/rails/command.rb:73 block in invoke |  
lib/rails/command.rb:149 with_argv |  
lib/rails/command.rb:69 invoke |  
lib/rails/commands.rb:18 <top (required)> |  
/usr/local/lib/ruby/3.3.0/bundled_gems.rb:74 require |  
/usr/local/lib/ruby/3.3.0/bundled_gems.rb:74 block (2 levels) in replace_require |  
bin/rails:4 <main>
here is the application log from boot to eventual restart
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | [2024-07-24T03:00:42 (process) #1][INFO] appsignal: Starting AppSignal 3.11.0 (bin/rails, Ruby 3.3.4, x86_64-linux-musl)
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 2024-07-24 03:00:43.283382 I [T5600] Rails -- Loaded .env
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 2024-07-24 03:00:51.482689 I [T5600] SolidQueue -- SolidQueue-0.3.4 Register Supervisor (210.7ms)  pid: 1, hostname: "staging-docker-worker-3:gate_worker_1"
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 2024-07-24 03:00:51.571082 W [T5600] SolidQueue -- SolidQueue-0.3.4 Tried to replace forked process but it had already died (0.0ms)  pid: 11, status: 0, pid_from_status: 11, signaled: false, stopsig: nil, termsig: nil
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 2024-07-24 03:05:53.177324 I [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Deregister Supervisor (87.6ms)  process_id: 245, pid: 1, hostname: "staging-docker-worker-3:gate_worker_1", last_heartbeat_at: Wed, 24 Jul 2024 02:59:30.760765000 CEST +02:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 245, kind: "Supervisor", last_heartbeat_at: "2024-07-24 02:59:30.760765000 +0200", supervisor_id: nil, pid: 1, hostname: "staging-docker-worker-3:gate_worker_1", metadata: {}, created_at: "2024-07-23 15:58:24.050132000 +0200">, :pruned=>true, :claimed_size=>0}

gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 2024-07-24 12:48:42.267770 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (625.4ms)  supervisor_pid: 1, supervised_pids: [28, 31]
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | bin/rails aborted!
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | ActiveRecord::ConnectionFailed: PQconsumeInput() server closed the connection unexpectedly (ActiveRecord::ConnectionFailed)
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 	This probably means the server terminated abnormally
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 	before or while processing the request.
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    |
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    |
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | Caused by:
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly (PG::ConnectionBad)
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 	This probably means the server terminated abnormally
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 	before or while processing the request.
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    |
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | Tasks: TOP => solid_queue:start
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | (See full trace by running task with --trace)

Here is our configuration

  • tried to rescue the error with on thread error but that doesn't catch it
  • tried the pidfile because we are seeing a warning on start: W [T6860] SolidQueue -- SolidQueue-0.3.4 Tried to replace forked process but it had already died (0.0ms) pid: 13, status: 0, pid_from_status: 13, signaled: false, stopsig: nil, termsig: nil this warning may be the result of the error in the deregister call now that i think about it.
  SolidQueue.supervisor_pidfile = "/tmp/solid_queue_supervisor.pid"
  SolidQueue.on_thread_error = proc do |error|
    case error
    when ActiveRecord::ConnectionFailed
      Rails.logger.error "SolidQueue: #{error.message}"
    else
      Rails.error.report(exception, handled: false)
    end
  end

InteNs avatar Jul 24 '24 10:07 InteNs

Hey @InteNs, thanks for the detailed report. I'll look into this!

rosa avatar Jul 24 '24 11:07 rosa

@InteNs, would you happen to have some more logs right before the restart? That is, a few more lines before this:

gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 2024-07-24 12:48:42.267770 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (625.4ms)  supervisor_pid: 1, supervised_pids: [28, 31]
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | bin/rails aborted!

rosa avatar Jul 24 '24 19:07 rosa

hmm those are all the (info level) logs there are, above the blank line was the container start at 03:00 AM, and below the blank line the restart, nothing was logged in between. I will set the log level to debug and trigger a restart, I will add those logs here in a minute!

is it possible forked processes are not logging to the same logger? I know for puma we have a on_worker_boot callback to reopen the logs:

on_worker_boot do
  # Re-open appenders after forking the process
  SemanticLogger.reopen
end

I've already set SemanticLogger as the logger for solidqueue: SolidQueue.logger = SemanticLogger[SolidQueue]

InteNs avatar Jul 24 '24 20:07 InteNs

@rosa okay so here are the debug logs, i could't get the exception to trigger this time, weird...

Makes it seem like it's maybe something that only happens after it's running a while? or something to do with forks?

what is weird about these logs is that after running for 5 minutes the supervisor gets killed and replaced by another process or something? This might be related to an idle database connection timeout maybe? (If i leave a rails console open for 5 minutes i need to reload! in order to get a new db connection)

debug logs
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:47.278559 I [T5600] Rails -- Loaded .env
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.882022 D [T5600] (0.824ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.883978 D [T5600] (2.831ms) ActiveRecord -- SolidQueue::Process Create -- {:sql=>"INSERT INTO \"solid_queue_processes\" (\"kind\", \"last_heartbeat_at\", \"supervisor_id\", \"pid\", \"hostname\", \"metadata\", \"created_at\") VALUES ($1, $2, $3, $4, $5, $6, $7) RETURNING \"id\"", :binds=>{:kind=>"Supervisor", :last_heartbeat_at=>"2024-07-24 20:45:54.686520", :supervisor_id=>nil, :pid=>1, :hostname=>"staging-docker-worker-3:gate_worker_1", :metadata=>nil, :created_at=>"2024-07-24 20:45:54.879627"}, :allocations=>140, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.968576 D [T5600] (76.7ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.968846 I [T5600] SolidQueue -- SolidQueue-0.3.4 Register Supervisor (282.3ms)  pid: 1, hostname: "staging-docker-worker-3:gate_worker_1"
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.977131 W [T5600] SolidQueue -- SolidQueue-0.3.4 Tried to replace forked process but it had already died (0.0ms)  pid: 12, status: 0, pid_from_status: 12, signaled: false, stopsig: nil, termsig: nil
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:55.768863 D [Tio-worker-1] (199.6ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"last_heartbeat_at\" <= $1 ORDER BY \"solid_queue_processes\".\"id\" ASC LIMIT $2 FOR UPDATE SKIP LOCKED", :binds=>{:last_heartbeat_at=>"2024-07-24 20:40:55.471181", :limit=>50}, :allocations=>2601, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:55.769637 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Prune dead processes (593.5ms)  size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:56.275987 D [Tio-worker-1] (2.610ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:56.278357 D [Tio-worker-1] (1.557ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>42, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:56.380089 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (203.2ms)  size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:46:54.973618 D [Tio-worker-1] (0.950ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:46:54.974810 D [Tio-worker-1] (2.207ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:46:54.971098", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:46:54.976455 D [Tio-worker-1] (1.387ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:47:54.979258 D [Tio-worker-1] (0.968ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:47:54.980239 D [Tio-worker-1] (2.028ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:47:54.977515", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:47:54.982088 D [Tio-worker-1] (1.617ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:48:54.989100 D [Tio-worker-1] (1.302ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:48:54.991324 D [Tio-worker-1] (3.472ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:48:54.985011", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:48:54.993745 D [Tio-worker-1] (2.011ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:49:55.003006 D [Tio-worker-1] (1.238ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:49:55.005151 D [Tio-worker-1] (3.673ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:49:54.998115", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:49:55.007899 D [Tio-worker-1] (2.102ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}


gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:55.015550 D [Tio-worker-1] (1.146ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:55.017647 D [Tio-worker-1] (3.607ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:50:55.011764", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:55.020281 D [Tio-worker-1] (2.034ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.387335 D [Tio-worker-1] (2.713ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"last_heartbeat_at\" <= $1 ORDER BY \"solid_queue_processes\".\"id\" ASC LIMIT $2 FOR UPDATE SKIP LOCKED", :binds=>{:last_heartbeat_at=>"2024-07-24 20:45:56.381719", :limit=>50}, :allocations=>9, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.473909 D [Tio-worker-1] (1.148ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.477218 D [Tio-worker-1] (1.161ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>44, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.479746 D [Tio-worker-1] (0.609ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"supervisor_id\" = $1", :binds=>{:supervisor_id=>251}, :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.481572 D [Tio-worker-1] (1.059ms) ActiveRecord -- SolidQueue::Process Destroy -- {:sql=>"DELETE FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"id\" = $1", :binds=>{:id=>251}, :allocations=>46, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.483263 D [Tio-worker-1] (0.831ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>46, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.484836 D [Tio-worker-1] (0.616ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.485096 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (3.0ms)  size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.486440 D [Tio-worker-1] (1.221ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>67, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.489142 I [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Deregister Supervisor (12.3ms)  process_id: 251, pid: 1, hostname: "staging-docker-worker-3:gate_worker_1", last_heartbeat_at: Wed, 24 Jul 2024 22:41:33.118215000 CEST +02:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 251, kind: "Supervisor", last_heartbeat_at: "2024-07-24 22:41:33.118215000 +0200", supervisor_id: nil, pid: 1, hostname: "staging-docker-worker-3:gate_worker_1", metadata: {}, created_at: "2024-07-24 12:49:27.575597000 +0200">, :pruned=>true, :claimed_size=>0}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.489220 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Prune dead processes (107.8ms)  size: 1
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.491929 D [Tio-worker-1] (1.414ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>85, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.493298 D [Tio-worker-1] (0.686ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>42, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.493751 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (3.9ms)  size: 0

I'm gonna keep restarting to see if I can get the exception to trigger again.

edit:

Managed to trigger it again by killing the container:

gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:39:23.094078 D [Tio-worker-1] (1.658ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:23.103831 D [Tio-worker-1] (1.196ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:23.108470 D [Tio-worker-1] (5.944ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 21:40:23.101274", :id=>269}, :allocations=>83, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:23.110399 D [Tio-worker-1] (1.582ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:50.970872 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (412.3ms)  supervisor_pid: 1, supervised_pids: [124, 136]
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | bin/rails aborted!
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:51.171604 D [T5600] (81.7ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>269}, :allocations=>123, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | ActiveRecord::ConnectionFailed: PQconsumeInput() server closed the connection unexpectedly (ActiveRecord::ConnectionFailed)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 	This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 	before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | Caused by:
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly (PG::ConnectionBad)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 	This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 	before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | Tasks: TOP => solid_queue:start
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | (See full trace by running task with --trace)

Weird how it's fetching claimed executions after saying that the supervisor was terminated gracefully right?

Also that debug entry with the query happening while the exception is being logged suggests it maybe comes from another process?

Might this be because the supervisor is actually pid 1? I will try to run this container with --init and see if that does anything

edit: It seems adding init: true to our compose file got rid of the warning we were getting on startup every time: W [T5600] SolidQueue -- SolidQueue-0.3.4 Tried to replace forked process but it had already died (0.0ms)

Will need to wait and see if it also gets rid of the exception

InteNs avatar Jul 24 '24 21:07 InteNs

Thanks a lot, @InteNs, that's very helpful 🙏

Weird how it's fetching claimed executions after saying that the supervisor was terminated gracefully right?

I think that is the query to calculate how many claimed executions processes have when they're terminated for logging purposes, after completing the shutdown 🤔

what is weird about these logs is that after running for 5 minutes the supervisor gets killed and replaced by another process or something?

I think I know what's going on. Looks like the supervisor that's apparently getting killed there is not the current supervisor. It's the old one, and only its record in the DB, that's staying around from a previous run, and that gets cleaned up when its heartbeat expires 🤔 Looking at the IDs in the queries:

gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:49:55.005151 D [Tio-worker-1] (3.673ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>
{:last_heartbeat_at=>"2024-07-24 20:49:54.998115", :id=>260}, :allocations=>83, :cached=>nil}

This is the heartbeat for process with ID 260, which is the current supervisor when these logs are happening. Then, we see the following two queries:

gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.479746 D [Tio-worker-1] (0.609ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"supervisor_id\" = $1", 
:binds=>{:supervisor_id=>251}, :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.481572 D [Tio-worker-1] (1.059ms) ActiveRecord -- SolidQueue::Process Destroy -- {:sql=>"DELETE FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"id\" = $1", 
:binds=>{:id=>251}, :allocations=>46, :cached=>nil}

There we have ID 251 for the process that's being deleted, and supervisor ID 251 for the SELECT query that tries to fetch its supervised processes to deregister them as well (looks like there aren't any, otherwise we'd see the DELETE queries for them as well).

This tells us that the supervisor from the pervious run failed to be correctly deregistered. I suspect this must be because of the error you're getting precisely while logging around deregistering 🤔

gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:50.970872 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (412.3ms)  supervisor_pid: 1, supervised_pids: [124, 136]
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | bin/rails aborted!
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:51.171604 D [T5600] (81.7ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>269}, :allocations=>123, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | ActiveRecord::ConnectionFailed: PQconsumeInput() server closed the connection unexpectedly (ActiveRecord::ConnectionFailed)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |  This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |  before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |

So it looks like the connection is closed before the deregistering takes place and thus the error and the supervisor hanging around... I'll see how to improve this, it's actually quite tight to do the deregistration after the shutdown if the shutdown happens using up all the shutdown timeout.

May I ask how the restart of your containers happen? Do the process running get send a TERM signal and then gets some time to wait before being killed? That might be playing a role here too, if the time there is shorter than SolidQueue.shutdown_timeout.

rosa avatar Jul 25 '24 07:07 rosa

May I ask how the restart of your containers happen?

sure! we use the default docker swarm settings, so that would be SIGTERM followed by a SIGKILL after 10 seconds. SolidQueue has a shutdown_timout of 5 seconds so I would assume there is plenty of time?

I will increase both solid queue and docker timeout to see if it helps.

Also can confirm that adding init: true (thus not running the supervisor as pid 1) did not solve the exception, only the warning at startup.

InteNs avatar Jul 25 '24 08:07 InteNs

@rosa can the cause possibly be that when the supervisor kills its forked processes, those processes terminate the database socket connection that is shared between the different processes? or does each fork get a seperate database connection?

https://stackoverflow.com/questions/13089875/fork-ruby-activerecord-and-file-descriptors-on-fork https://stackoverflow.com/questions/27892975/what-causes-activerecord-breaking-postgres-connection-after-forking

i'm getting the same error when my console is idle for too long and postgres kills the connection

InteNs avatar Jul 25 '24 12:07 InteNs

sure! we use the default docker swarm settings, so that would be SIGTERM followed by a SIGKILL after 10 seconds. SolidQueue has a shutdown_timout of 5 seconds so I would assume there is plenty of time?

Yeah, definitely, that should work fine. In my tests of this I used 2 seconds as shutdown_timeout and that was fine, so it seems this should work. Perhaps something takes longer here...

those processes terminate the database socket connection that is shared between the different processes? or does each fork get a seperate database connection?

https://stackoverflow.com/questions/13089875/fork-ruby-activerecord-and-file-descriptors-on-fork https://stackoverflow.com/questions/27892975/what-causes-activerecord-breaking-postgres-connection-after-forking

Ohhh, good question! I think from Rails 5.2+, connections across forks are seamlessly handled by Rails, so it shouldn't be the case. I'm going to add some tests for this, in any case, where forked processes get terminated independently from the supervisor, to confirm this works fine.

I imagine this is not the case, but is there anything else external you might be running that could be closing all PostgreSQL connections from your job containers when there's a restart?

I'll keep digging and in any case, going to see if I can improve the shutdown + deregistering process.

rosa avatar Jul 25 '24 13:07 rosa

Ok, I've got some changes ready around shutdown + deregistering, that will go out as part of a larger change. However, I realised something... in this comment you mentioned:

Managed to trigger it again by killing the container:

How is that "killing" done? From the logs, it seems a SIGTERM was sent, but was also a SIGKILL sent right after, without waiting any time? Or did this follow the "SIGTERM followed by a SIGKILL after 10 seconds" setup?

rosa avatar Jul 26 '24 12:07 rosa

oh I've been testing a combination of SIGKILL and SIGTERM, they both sometimes trigger the exception and sometimes not...

manually killing worker and dispatcher processes work fine and the supervisor starts new ones like it's supposed to. manually killing the supervisor (both term and kill sometimes trigger the error and sometimes not.

the manual killing was done in a shell inside the container

however in our production setup the restarting is managed by docker so the 10 second (now 20) wait before the final sigkill

Looking forward to try those changes!

InteNs avatar Jul 26 '24 14:07 InteNs

@InteNs thanks so much for doing these tests! I should have them ready early next week.

rosa avatar Jul 26 '24 16:07 rosa

Hey @InteNs, I opened https://github.com/rails/solid_queue/pull/269, which includes a bunch of changes on running the supervisor, if you want to try them. The main change is unrelated to this, but I refactored the supervisor's cycle quite a bit. There's a breaking change that might not affect you, depending on how you're running Solid Queue. It's described here.

This also reminded me that I wanted to ask how you were running Solid Queue: is it via Docker's CMD? And if so, what's the command? Is it just bin/rails solid_queue:start? Wondering if there's something there I'm missing.

rosa avatar Jul 29 '24 19:07 rosa

Hey! I will try this out soon!

is it via Docker's CMD?

yup!

Is it just bin/rails solid_queue:start?

yup!

InteNs avatar Jul 30 '24 16:07 InteNs

@rosa We've been running #269 on one of our apps for the entire weekend and haven't seen the error yet! It has definitely been rebooted a few times and just now manually too. very smooth now!

InteNs avatar Aug 05 '24 09:08 InteNs

Ohh, great news! We've been running that branch also for quite a few days in production for HEY and everything has worked well, so I'm going to merge that now and ship a new version. I'll keep this one open in case you run into this error again.

Thanks a lot again! 🙏

rosa avatar Aug 05 '24 10:08 rosa

Hey @InteNs, did you encounter this error again?

rosa avatar Aug 22 '24 19:08 rosa

Hey @rosa I have not, we can consider this fixed, thanks so much!

I'm curious if this error returns with version 0.6, but if it does, it's easy to catch and retry in active_job retry_on mechanism probably!

InteNs avatar Aug 28 '24 07:08 InteNs