delayed_job
delayed_job copied to clipboard
MySQL Connection Lost
Not sure how this isn't an issue for a larger group of people, but here is our experience:
- Jobs are running successfully
- MySQL connection is temporarily lost, or unavailable to the worker
- Worker does not attempt to re-establish connection, and is thus an orphaned process not working off any jobs
- Devs have to manually restart delayed job workers any time this happens
We've tried added some ensure_db_connection
logic throughout this gem and the delayed_job_active_record gem, but to no avail. Slight modification of this (https://github.com/bracken/delayed_job/commit/b73a7e561b89fc09ae77bb9c1cfc680062867e19)
Anyone else suffering from this insidious bug?
Are you using reconnect: true
in the database.yml?
Also Delayed::Job.recover_from(error)
provides you with a hook to try and reset the connection
@betamatt we tried with reconnect: true
in database.yml, but to no avail. Our test process mimics the bullets denoted in the initial comment.
@albus522, we'll try that next and update with our finding. Still pretty surprised this isn't a more commonly complained about issue.
Hey, I'm actually @jwg2s's colleague and we've been working on this together, but still have no resolution.
@albus522, regarding the using of recover_from
as where it's placed in the rescue block of reserve_job
, it isn't where the lost connection exception was being thrown. It was being thrown in the run
method. As such, here's what we had for run
:
def run(job)
job_say job, 'RUNNING'
runtime = Benchmark.realtime do
say "Beginning to run job"
Timeout.timeout(self.class.max_run_time.to_i, WorkerTimeout) { job.invoke_job }
job.destroy
end
job_say job, format('COMPLETED after %.4f', runtime)
return true # did work
rescue DeserializationError => error
job.last_error = "#{error.message}\n#{error.backtrace.join("\n")}"
failed(job)
rescue => error
# If connection is dead, re-establish connection
if !ActiveRecord::Base.connection.active?
ensure_db_connection
retry
end
self.class.lifecycle.run_callbacks(:error, self, job) { handle_failed_job(job, error) }
return false # work failed
end
ensure_db_connection
being similar to https://github.com/bracken/delayed_job/commit/b73a7e561b89fc09ae77bb9c1cfc680062867e19 which @jwg2s mentioned above:
def ensure_db_connection
say "ABOUT TO TEST DB CONNECTION"
begin
say "TESTING CONNECTION"
ActiveRecord::Base.connection.execute("select 'I am alive'")
rescue ActiveRecord::StatementInvalid
say "CONNECTION IS DEAD"
while !ActiveRecord::Base.connection.active? do
ActiveRecord::Base.connection.reconnect!
say "TRYING TO CONNECT..."
sleep(5)
end
end
end
ensure_db_connection
enters its rescue
block, but never appears to enter the while
block since TRYING TO CONNECT... never gets logged.
Based on a few samples, I've found 100% correlation between the occasions on which delayed_job needed to be restarted and dates on which MySQL was restarted. And by way of confirmation, I manually restarted MySQL and sure enough the next morning the cron job reported that the delayed_job needed to be restarted.
I'm experiencing this too. Seems like using MySQL with Delayed Job is really wonky. The handler value that is going into the database is incredibly long in my case.
I'm experiencing this as well and it's killing us. We have very long jobs and if the mysql connection dies the record stays locked and in limbo. Right now we're writing a sweeper to go through the db and delete jobs that have been running longer than a certain time.
Could do with a fix here!
+1
FWIW, we are facing a similar issue with Postgres when apt
auto-upgrades and thus stops/starts Postgres on our production server.