cloudtasker
cloudtasker copied to clipboard
Inconsistent executions count with ActiveJob.retry_on
Rails 5.2.2, gem version 0.11.
I understand retry_on
and company not supported yet (though discard_on
simply works), so this is not strictly a bug.
executions
(with provider_id
and priority
though these 2 are not problematic) is filtered from serialization (ActiveJob::QueueAdapters#build_worker
and ActiveJob::QueueAdapters::SERIALIZATION_FILTERED_KEYS
) and supplied by google cloud tasks (also see #6), however because of retrying/rescheduling (new task id, retry count is 0) this keeps resetting, which in turn leads to never ending retrial.
If retry_on
functionality is desirable i was thinking maybe putting this information in job_meta
(or even in root worker_payload
) it could be retained. If you are not opposing this change i would gladly try to make a PR for it.
checked commenting out the executions
from SERIALIZATION_FILTERED_KEYS
and the override with job_executions
in JobWrapper#perform
makes it work as expected (obviously this simply ignores the value sent in headers).
upon further thinking about this i am not sure what should be the expected way of working. retry_on
already overlaps with the backend's retrial, e.g. retry_on
simply reschedules the task without raising the error, completing the actual running one, and when it runs out of attempts it simply reraises the error (by default, eg without a block), that will kick in the backend's retry mechanism.
sidekiq works this way also, but i haven't checked the execution count specificallly, eg for a failing task that is retry_on Exception, attempts: 2
, after the 2 trials when it is retried with sidekiq's mechanism, does it raise the execution count or not.
I don't have full background as to why ActiveJob decided to go with the retry_on
option. To me it seems (1) redundant with the retry capabilities of most backends and (2) confusing because errors handling logic is therefore split between retry_on
and the backend retry logic.
I'm going to close this issue until there is a strong use case for it, in which case we can reopen.
well for me retry_on
is the better/preferred way. first it's consistent, not locking into the backend's mechanism, so i am free to switch between sidekiq and cloudtasker and be confident the api is/behaves the same. that's a big win. second it has way more features than the backend, for example i can specify retry mechanism per exception class (though in rails 5 the exceptions are counted with a single counter, but that is corrected for rails6) without additional glue code (that is again backend specific and has to be refactored all the time switching backends).
i agree it's a bit confusing, but providing an empty block or error tracking simply solves the backend mechanism kicking in. i feel the overlap and confusion comes because the backends all want to provide a way to be usable without rails or activejob so they "duplicate" its functionality (like cloudtasker's on_error and on_dead, which are already baked in activejob).
i understand if you are opposing this change, so i am not pushing it, but i thought i will share my 2cents.
just checked with sidekiq, and can confirm, it works as described, e.g. with retry_on Exception, wait: 5.seconds, attempts: 3
i see the 3 runs with executions
1, 2 and 3, then (error is reraised,) sidekiq's own retry mechanism kicks in, however executions
is never increased anymore, so it stays at 3 for any further performs.
edit: and just to be explicit, without retry_on
(eg with only sidekiq's own retry mechanism) executions
is simply never incremented and stays 1 no matter how many retries. so executions
is mostly part of ActiveJob's own retry mechanism (eg retry_on
).
Fair enough. I've reopened the issue. If you feel like opening a PR for it, feel free to do so. Otherwise I'll check how to address this some time in the future.
i can cook up a PR for this the next week, do you have any suggestions? should i just ignore the values from headers (when using the JobWrapper only, so only for ActiveJob)? should i add helper methods to activejob (so users can actually get those values if they want to)?
It has been a while, but I've made a proposal for an intermediate solution based on some of the suggestions in the issue. It comes with a few sharp edges but would already provide an escape hatch to the override and could be further extended as needed. If you can spare some time, it would be great to get some feedback 🙏 .