cloudtasker icon indicating copy to clipboard operation
cloudtasker copied to clipboard

Inconsistent executions count with ActiveJob.retry_on

Open eLod opened this issue 3 years ago • 7 comments

Rails 5.2.2, gem version 0.11.

I understand retry_on and company not supported yet (though discard_on simply works), so this is not strictly a bug.

executions (with provider_id and priority though these 2 are not problematic) is filtered from serialization (ActiveJob::QueueAdapters#build_worker and ActiveJob::QueueAdapters::SERIALIZATION_FILTERED_KEYS) and supplied by google cloud tasks (also see #6), however because of retrying/rescheduling (new task id, retry count is 0) this keeps resetting, which in turn leads to never ending retrial.

If retry_on functionality is desirable i was thinking maybe putting this information in job_meta (or even in root worker_payload) it could be retained. If you are not opposing this change i would gladly try to make a PR for it.

eLod avatar Aug 17 '21 12:08 eLod

checked commenting out the executions from SERIALIZATION_FILTERED_KEYS and the override with job_executions in JobWrapper#perform makes it work as expected (obviously this simply ignores the value sent in headers).

upon further thinking about this i am not sure what should be the expected way of working. retry_on already overlaps with the backend's retrial, e.g. retry_on simply reschedules the task without raising the error, completing the actual running one, and when it runs out of attempts it simply reraises the error (by default, eg without a block), that will kick in the backend's retry mechanism.

sidekiq works this way also, but i haven't checked the execution count specificallly, eg for a failing task that is retry_on Exception, attempts: 2, after the 2 trials when it is retried with sidekiq's mechanism, does it raise the execution count or not.

eLod avatar Aug 17 '21 15:08 eLod

I don't have full background as to why ActiveJob decided to go with the retry_on option. To me it seems (1) redundant with the retry capabilities of most backends and (2) confusing because errors handling logic is therefore split between retry_on and the backend retry logic.

I'm going to close this issue until there is a strong use case for it, in which case we can reopen.

alachaum avatar Aug 18 '21 07:08 alachaum

well for me retry_on is the better/preferred way. first it's consistent, not locking into the backend's mechanism, so i am free to switch between sidekiq and cloudtasker and be confident the api is/behaves the same. that's a big win. second it has way more features than the backend, for example i can specify retry mechanism per exception class (though in rails 5 the exceptions are counted with a single counter, but that is corrected for rails6) without additional glue code (that is again backend specific and has to be refactored all the time switching backends).

i agree it's a bit confusing, but providing an empty block or error tracking simply solves the backend mechanism kicking in. i feel the overlap and confusion comes because the backends all want to provide a way to be usable without rails or activejob so they "duplicate" its functionality (like cloudtasker's on_error and on_dead, which are already baked in activejob).

i understand if you are opposing this change, so i am not pushing it, but i thought i will share my 2cents.

eLod avatar Aug 18 '21 09:08 eLod

just checked with sidekiq, and can confirm, it works as described, e.g. with retry_on Exception, wait: 5.seconds, attempts: 3 i see the 3 runs with executions 1, 2 and 3, then (error is reraised,) sidekiq's own retry mechanism kicks in, however executions is never increased anymore, so it stays at 3 for any further performs.

edit: and just to be explicit, without retry_on (eg with only sidekiq's own retry mechanism) executions is simply never incremented and stays 1 no matter how many retries. so executions is mostly part of ActiveJob's own retry mechanism (eg retry_on).

eLod avatar Aug 18 '21 11:08 eLod

Fair enough. I've reopened the issue. If you feel like opening a PR for it, feel free to do so. Otherwise I'll check how to address this some time in the future.

alachaum avatar Aug 19 '21 13:08 alachaum

i can cook up a PR for this the next week, do you have any suggestions? should i just ignore the values from headers (when using the JobWrapper only, so only for ActiveJob)? should i add helper methods to activejob (so users can actually get those values if they want to)?

eLod avatar Aug 19 '21 13:08 eLod

It has been a while, but I've made a proposal for an intermediate solution based on some of the suggestions in the issue. It comes with a few sharp edges but would already provide an escape hatch to the override and could be further extended as needed. If you can spare some time, it would be great to get some feedback 🙏 .

lagr avatar Dec 20 '23 09:12 lagr