delayed_job icon indicating copy to clipboard operation
delayed_job copied to clipboard

MySQL Connection Lost

Open jwg2s opened this issue 10 years ago • 9 comments

Not sure how this isn't an issue for a larger group of people, but here is our experience:

  • Jobs are running successfully
  • MySQL connection is temporarily lost, or unavailable to the worker
  • Worker does not attempt to re-establish connection, and is thus an orphaned process not working off any jobs
  • Devs have to manually restart delayed job workers any time this happens

We've tried added some ensure_db_connection logic throughout this gem and the delayed_job_active_record gem, but to no avail. Slight modification of this (https://github.com/bracken/delayed_job/commit/b73a7e561b89fc09ae77bb9c1cfc680062867e19)

Anyone else suffering from this insidious bug?

jwg2s avatar Aug 05 '14 15:08 jwg2s

Are you using reconnect: true in the database.yml?

betamatt avatar Aug 05 '14 16:08 betamatt

Also Delayed::Job.recover_from(error) provides you with a hook to try and reset the connection

albus522 avatar Aug 05 '14 16:08 albus522

@betamatt we tried with reconnect: true in database.yml, but to no avail. Our test process mimics the bullets denoted in the initial comment.

@albus522, we'll try that next and update with our finding. Still pretty surprised this isn't a more commonly complained about issue.

jwg2s avatar Aug 05 '14 17:08 jwg2s

Hey, I'm actually @jwg2s's colleague and we've been working on this together, but still have no resolution.

@albus522, regarding the using of recover_from as where it's placed in the rescue block of reserve_job, it isn't where the lost connection exception was being thrown. It was being thrown in the run method. As such, here's what we had for run:

    def run(job)
      job_say job, 'RUNNING'
      runtime =  Benchmark.realtime do
        say "Beginning to run job"
        Timeout.timeout(self.class.max_run_time.to_i, WorkerTimeout) { job.invoke_job }
        job.destroy
      end
      job_say job, format('COMPLETED after %.4f', runtime)
      return true  # did work
    rescue DeserializationError => error
      job.last_error = "#{error.message}\n#{error.backtrace.join("\n")}"
      failed(job)
    rescue => error
      # If connection is dead, re-establish connection
      if !ActiveRecord::Base.connection.active?
        ensure_db_connection
        retry
      end
      self.class.lifecycle.run_callbacks(:error, self, job) { handle_failed_job(job, error) }
      return false  # work failed
    end

ensure_db_connection being similar to https://github.com/bracken/delayed_job/commit/b73a7e561b89fc09ae77bb9c1cfc680062867e19 which @jwg2s mentioned above:

    def ensure_db_connection
      say "ABOUT TO TEST DB CONNECTION"
      begin
        say "TESTING CONNECTION"
        ActiveRecord::Base.connection.execute("select 'I am alive'")
      rescue ActiveRecord::StatementInvalid
        say "CONNECTION IS DEAD"
        while !ActiveRecord::Base.connection.active? do
          ActiveRecord::Base.connection.reconnect!
          say "TRYING TO CONNECT..."
          sleep(5)
        end
      end
    end

ensure_db_connection enters its rescue block, but never appears to enter the while block since TRYING TO CONNECT... never gets logged.

jhaber1 avatar Aug 05 '14 19:08 jhaber1

Based on a few samples, I've found 100% correlation between the occasions on which delayed_job needed to be restarted and dates on which MySQL was restarted. And by way of confirmation, I manually restarted MySQL and sure enough the next morning the cron job reported that the delayed_job needed to be restarted.

wbreeze avatar Sep 11 '14 19:09 wbreeze

I'm experiencing this too. Seems like using MySQL with Delayed Job is really wonky. The handler value that is going into the database is incredibly long in my case.

gregblass avatar Dec 18 '16 19:12 gregblass

I'm experiencing this as well and it's killing us. We have very long jobs and if the mysql connection dies the record stays locked and in limbo. Right now we're writing a sweeper to go through the db and delete jobs that have been running longer than a certain time.

Could do with a fix here!

philsmy avatar Mar 28 '17 22:03 philsmy

+1

master-of-null avatar Jun 27 '18 22:06 master-of-null

FWIW, we are facing a similar issue with Postgres when apt auto-upgrades and thus stops/starts Postgres on our production server.

smoyte avatar Feb 19 '20 18:02 smoyte