resque-retry
resque-retry copied to clipboard
Dirty exits during `on_failure` hooks do not consider retry limit properly
Discovered a hairy issue! It involved a couple moving parts:
- We had a job that was always raising a particular exception in
perform
due to some bad data. - The exception would trigger the
on_failure_retry
hook to run within the child process (as you'd expect). - This particular exception would not be retried, thus going through
give_up
and cleaning the retry key. - Then, as the
on_failure_retry
hook continued to delegate to our otherResque::Failure
backends, we had one that would cause a rapid memory leak (not resque-retry's fault). - Thus, before the child's
on_failure_retry
could finish, the process would be SIGKILLed for running out of memory (cf. https://stackoverflow.com/a/46284715). - This leaves the parent Resque worker process to invoke
job.fail
with aResque::DirtyExit
(cf. https://github.com/resque/resque/blob/fa72b7b30adbbe9cf15d48784df78367b0294e82/lib/resque/worker.rb#L940). - So then the
on_failure_retry
hook would run again in the parent process with theResque::DirtyExit
, where we'd actually decide to retry it (unlike the initial exception). - However, since the retry key was deleted, the
@retry_attempt
comes in as 0 here: https://github.com/lantins/resque-retry/blob/db5dfd4f598e2566d54b5b96c8934b56ed5f3be0/lib/resque/plugins/retry.rb#L504-L507 - Due to our particular failure pattern, the re-enqueued job would trigger the same exception as before in
perform
. - Thus, an infinite loop was formed where we'd keep trying the job, failing, killing the process, then retrying based on the
Resque::DirtyExit
, which would never hit theretry_limit
because the key is always being deleted.
I tried coming up with a fix, but couldn't figure out how to juggle the subtleties. The best I could do is drum up a regression test that reproduces the failure for me locally: https://github.com/ajvondrak/resque-retry/commit/7e795bfc6d8704dca1ae36aacc111dba491b0ed9
diff --git a/test/retry_test.rb b/test/retry_test.rb
index f305134..f2fde92 100644
--- a/test/retry_test.rb
+++ b/test/retry_test.rb
@@ -325,5 +325,22 @@ def test_retry_on_dirty_exit
assert_nil @worker.reserve
end
+
+ def test_dirty_exits_during_failure_hooks
+ @worker.fork_per_job = true
+ Resque.enqueue(KilledDuringFailureHooksJob)
+
+ @worker.work_one_job
+ assert_equal 1, Resque.info[:processed], 'Did not process job'
+ assert_equal 1, Resque.info[:failed], 'Job should have failed'
+ assert_equal 1, Resque.info[:pending], 'Retry should be pending'
+ refute_nil Resque.peek(:testing), 'Retry should have been enqueued'
+
+ @worker.work_one_job
+ assert_equal 2, Resque.info[:processed], 'Did not process retry'
+ assert_equal 2, Resque.info[:failed], 'Retry should have failed'
+ assert_equal 0, Resque.info[:pending], 'Another retry should not be pending'
+ assert_nil Resque.peek(:testing), 'Should have only retried once'
+ end
end
end
diff --git a/test/test_jobs.rb b/test/test_jobs.rb
index 0e17b62..0ff4a16 100644
--- a/test/test_jobs.rb
+++ b/test/test_jobs.rb
@@ -613,6 +613,26 @@ def self.perform(*args)
end
end
+class KilledDuringFailureHooksJob
+ extend Resque::Plugins::Retry
+ @queue = :testing
+ @fatal_exceptions = [CustomException]
+
+ def self.perform
+ raise CustomException, "failed but won't retry"
+ end
+
+ # After clearing the retry key in the on_failure_retry supermethod, kill the
+ # child job's process in a way that triggers a Resque::DirtyExit in the
+ # parent worker process. The parent process will then run the
+ # on_failure_retry hook against the Resque::DirtyExit, which should retry by
+ # default.
+ def self.on_failure_retry(exception, *)
+ super
+ Process.kill("KILL", Process.pid) if exception.is_a?(CustomException)
+ end
+end
+
class RetryCallbacksJob
extend Resque::Plugins::Retry
@queue = :testing