stringer icon indicating copy to clipboard operation
stringer copied to clipboard

Improve robustness of rake work_jobs

Open zlogic opened this issue 9 years ago • 3 comments

First of all, many thanks on creating this awesome RSS reader! Just what I was looking for and was about to create myself, but Stringer is so much better than anything I wanted :)

I installed it into Heroku and noticed a problem with reliability of the Heroku-specific rake lazy_fetch approach:

While the bundle exec rake work_jobs is launched from config/unicorn.rb, it's never checked to be running and its delayed_job_pid is never used. It seems that Heroku kills long-running database connections either on purpose, or just randomly, which also causes an unhandled exception to kill the bundle exec rake work_jobs process (example log). While rake lazy_fetch continues to submit jobs, the worker process is dead and never restarted.

It seems the best way would be to run rake work_jobs from a Procfile. I've tried two other workarounds and can confirm they work:

  1. Restart the rake process in bash:
diff --git config/unicorn.rb config/unicorn.rb
--- config/unicorn.rb
+++ config/unicorn.rb
@@ -9,7 +9,7 @@ before_fork do |_server, _worker|
   # as there's no need for the master process to hold a connection
   ActiveRecord::Base.connection.disconnect! if defined?(ActiveRecord::Base)

-  @delayed_job_pid ||= spawn("bundle exec rake work_jobs")
+  @delayed_job_pid ||= spawn("while true; do bundle exec rake work_jobs; sleep 60; done")

   sleep 1
 end
  1. Catch exceptions in the work_jobs job and try to reconnect:
diff --git Rakefile Rakefile
--- Rakefile
+++ Rakefile
@@ -46,11 +46,25 @@ desc "Work the delayed_job queue."
 task :work_jobs do
   Delayed::Job.delete_all

-  3.times do
-    Delayed::Worker.new(
-      min_priority: ENV["MIN_PRIORITY"],
-      max_priority: ENV["MAX_PRIORITY"]
-    ).start
+  logger = Logger.new(STDOUT)
+  logger.level = Logger::DEBUG
+
+  loop do
+    begin
+      logger.info "(Re-)starting worker"
+      if !ActiveRecord::Base.connection.active?
+        logger.info "Restarting DB connection"
+        ActiveRecord::Base.connection.reconnect!
+      end
+      Delayed::Worker.new(
+        min_priority: ENV["MIN_PRIORITY"],
+        max_priority: ENV["MAX_PRIORITY"]
+      ).start
+    rescue => e
+      logger.error e.message
+      e.backtrace.each { |line| logger.error line }
+      sleep 60
+    end
   end
 end

zlogic avatar Feb 09 '16 20:02 zlogic

Yeah lazy_fetch is a big hack. The correct way, as you mentioned, is to run a separate dyno to work jobs -- but this means we can't run on the free plan (only 1 dyno).

Were the exceptions stopping your instance from updating the feeds? I agree that the robustness is not ideal, but I've not experience issues where my heroku instance stopped updating.

swanson avatar Feb 09 '16 21:02 swanson

Actually, it seems that Heroku counts a pair of worker + web as one "dyno":

  • https://devcenter.heroku.com/articles/dyno-sleeping
  • https://www.heroku.com/pricing

I run some other apps on the free tier with a worker + web combo and it works without major problems.

zlogic avatar Feb 09 '16 21:02 zlogic

In my case, the worker process stopped after 10-40 minutes, which also stopped updating feeds. The only way to restart it was to wait until the web dyno sleeps and then restarts itself and the worker. It looked something like:

  1. Open Stringer, starting the web and worker processes
  2. Worker process crashes
  3. Timer submits job, but worker is no longer active
  4. Open Stringer again, nothing happens (worker is still crashed)
  5. Wait until the dyno sleeps
  6. Open Stringer again, which clears the job queue and starts the worker
  7. If in luck, the timer will submit the jobs for the worker before it crashes

Maybe it was just a bad day and Heroku was installing security patches or something...

I also had an hourly timer, maybe it submitted jobs when the dynos were sent to sleep. Setting it to run every 10 minutes seems to have improved things a bit.

zlogic avatar Feb 09 '16 21:02 zlogic