resque-retry icon indicating copy to clipboard operation
resque-retry copied to clipboard

Dirty exits during `on_failure` hooks do not consider retry limit properly

Open ajvondrak opened this issue 8 months ago • 0 comments

Discovered a hairy issue! It involved a couple moving parts:

  1. We had a job that was always raising a particular exception in perform due to some bad data.
  2. The exception would trigger the on_failure_retry hook to run within the child process (as you'd expect).
  3. This particular exception would not be retried, thus going through give_up and cleaning the retry key.
  4. Then, as the on_failure_retry hook continued to delegate to our other Resque::Failure backends, we had one that would cause a rapid memory leak (not resque-retry's fault).
  5. Thus, before the child's on_failure_retry could finish, the process would be SIGKILLed for running out of memory (cf. https://stackoverflow.com/a/46284715).
  6. This leaves the parent Resque worker process to invoke job.fail with a Resque::DirtyExit (cf. https://github.com/resque/resque/blob/fa72b7b30adbbe9cf15d48784df78367b0294e82/lib/resque/worker.rb#L940).
  7. So then the on_failure_retry hook would run again in the parent process with the Resque::DirtyExit, where we'd actually decide to retry it (unlike the initial exception).
  8. However, since the retry key was deleted, the @retry_attempt comes in as 0 here: https://github.com/lantins/resque-retry/blob/db5dfd4f598e2566d54b5b96c8934b56ed5f3be0/lib/resque/plugins/retry.rb#L504-L507
  9. Due to our particular failure pattern, the re-enqueued job would trigger the same exception as before in perform.
  10. Thus, an infinite loop was formed where we'd keep trying the job, failing, killing the process, then retrying based on the Resque::DirtyExit, which would never hit the retry_limit because the key is always being deleted.

I tried coming up with a fix, but couldn't figure out how to juggle the subtleties. The best I could do is drum up a regression test that reproduces the failure for me locally: https://github.com/ajvondrak/resque-retry/commit/7e795bfc6d8704dca1ae36aacc111dba491b0ed9

diff --git a/test/retry_test.rb b/test/retry_test.rb
index f305134..f2fde92 100644
--- a/test/retry_test.rb
+++ b/test/retry_test.rb
@@ -325,5 +325,22 @@ def test_retry_on_dirty_exit
 
       assert_nil @worker.reserve
     end
+
+    def test_dirty_exits_during_failure_hooks
+      @worker.fork_per_job = true
+      Resque.enqueue(KilledDuringFailureHooksJob)
+
+      @worker.work_one_job
+      assert_equal 1, Resque.info[:processed], 'Did not process job'
+      assert_equal 1, Resque.info[:failed], 'Job should have failed'
+      assert_equal 1, Resque.info[:pending], 'Retry should be pending'
+      refute_nil Resque.peek(:testing), 'Retry should have been enqueued'
+
+      @worker.work_one_job
+      assert_equal 2, Resque.info[:processed], 'Did not process retry'
+      assert_equal 2, Resque.info[:failed], 'Retry should have failed'
+      assert_equal 0, Resque.info[:pending], 'Another retry should not be pending'
+      assert_nil Resque.peek(:testing), 'Should have only retried once'
+    end
   end
 end
diff --git a/test/test_jobs.rb b/test/test_jobs.rb
index 0e17b62..0ff4a16 100644
--- a/test/test_jobs.rb
+++ b/test/test_jobs.rb
@@ -613,6 +613,26 @@ def self.perform(*args)
   end
 end
 
+class KilledDuringFailureHooksJob
+  extend Resque::Plugins::Retry
+  @queue = :testing
+  @fatal_exceptions = [CustomException]
+
+  def self.perform
+    raise CustomException, "failed but won't retry"
+  end
+
+  # After clearing the retry key in the on_failure_retry supermethod, kill the
+  # child job's process in a way that triggers a Resque::DirtyExit in the
+  # parent worker process. The parent process will then run the
+  # on_failure_retry hook against the Resque::DirtyExit, which should retry by
+  # default.
+  def self.on_failure_retry(exception, *)
+    super
+    Process.kill("KILL", Process.pid) if exception.is_a?(CustomException)
+  end
+end
+
 class RetryCallbacksJob
   extend Resque::Plugins::Retry
   @queue = :testing

ajvondrak avatar May 30 '24 00:05 ajvondrak