proletarian
proletarian copied to clipboard
Job runs infinitely if `failed-job-fn` also fails
Hey!
process-next-job! seems to run in a transaction, if failed-job-fn also throws an exception (which runs in that transaction), the database changes to clean up the job will be rolled back, meaning the job won't be cleaned up and will be picked up again. If both the job handler and failed-job-fn keep failing, this would lead to an infinite retry loop.
I had two thoughts on how to resolve this:
- wrapping
failed-job-fninvocation in atryexpression, and catching and logging the exception thrown byfailed-job-fn. - just moving
failed-job-fninvocation out of the transaction.
I think the latter will make the failure more visible and hence makes more sense.
Please let me know if I'm missing anything, otherwise, I'd be happy to create a PR for this.
Thanks for the bug report and analysis!
Regarding approaches for solving this:
just moving failed-job-fn invocation out of the transaction
I agree that this makes it more visible, because then the exception would be caught in process-next-job! (https://github.com/msolli/proletarian/blob/main/src/proletarian/worker.clj#L97), causing the worker to stop after the error was logged (at least with the default configuration).
I think this is probably what we want when a failed-job-fn throws - it would mean the implementation has a serious error that must be corrected before proceeding with the work.
But this approach has some drawbacks, too. The current code is structured such that a lot of code would have to change. Maybe we can find a third approach?
wrapping failed-job-fn invocation in a try expression, and catching and logging the exception thrown by failed-job-fn.
If, instead of just logging the exception, we also set the current thread's interrupt status, then the polling would stop (because of the check at https://github.com/msolli/proletarian/blob/main/src/proletarian/worker.clj#L80).
What do you think?
I think logically it makes sense to set thread's interrupt status.
Practically on the other hand, if we didn't set the interrupt status, since we are not throwing the caught exception, the job would be successfully deleted, and on the next invocation process-next-job! would return nil due to the when-let here, and hence, only difference from setting the interrupt status would be one extra call as far as I can understand what's happening here.
1 call is 1 call though, and more importantly I think/agree setting the interrupt status makes what's happening clearer and makes the logic more understandable.
Is this what you have in mind?
To be clear, I think we want the worker to fail hard with errors like this. The exception should be logged, and the worker should stop polling for jobs. This is what happens with the change you've proposed, so that's good.
Why would we want the worker to stop? Because important stuff (for the application) might be happening in the failed-job-fn, stuff that uphold domain data invariants, for example. We want that code to run. So by failing hard and stopping the worker, the current job will be retried after new code has been deployed that fixes the issue. Since jobs should be idempotent, it shouldn't matter that it runs one more time. Now the failed-job-fn will run once more, and hopefully not fail. The transaction will then complete and the job moved to the archive.
As for the change: The maybe-retry! function in its current form is not very well tested. I'm working on improving the testing story to be able to test the logic in that function, including the proposed change.
Awesome. Thank you for the detailed reply!
I've sat down and written some tests and run through the code, and I think I've made an error in my thinking. Actually I now think the current behavior is correct, although with a couple of caveats.
When the failed-job-fn function throws, the current behavior is that this exception is caught at https://github.com/msolli/proletarian/blob/main/src/proletarian/worker.clj#L97:
(try
(let [log (log/wrap log {:worker-thread-id (::worker-thread-id config)})]
(log ::polling-for-jobs)
(loop []
(when (process-next-job! data-source queue handler-fn log config)
(recur))))
(catch SQLTransientException e
(log ::sql-transient-exception {:throwable e}))
(catch InterruptedException _
(log ::worker-interrupted)
(stop-queue-worker!))
(catch Throwable e
(log ::job-worker-error {:throwable e})
;; Stop polling if error handler returns true
(when ((::on-polling-error config) e)
(stop-queue-worker!))))
As long as ::on-polling-error returns true, the queue workers will stop, which is what we want.
If both the job handler and
failed-job-fnkeep failing, this would lead to an infinite retry loop.
There would only be an infinite loop if ::on-polling-error returns false, which is configurable (the default is for it to return true). So the first caveat is: ::on-polling-error must return true when there is an error in the failed-job-fn.
Which leads me to the next caveat: There is no proper reporting of an error in failed-job-fn, or a way to tell if it came from the failed-job-fn. It will be caught and logged with ::job-worker-error, but the log message is not clear, and the ::on-polling-error function won't be able to tell from the exception that it came from failed-job-fn. I will make it so that an error in failed-job-fn is caught and re-thrown with an exception that wraps the original exception. The logging will look like this then:
:proletarian.worker/job-worker-error {:throwable #error {
:cause Some exception in failed-job-fn
:via
[{:type java.lang.Exception
:message An error occurred when executing the :proletarian.retry/failed-job-fn function
:at [proletarian.retry$maybe_retry_BANG_ invokeStatic retry.clj 68]}
{:type java.lang.Exception
:message Some exception in failed-job-fn
:at [example_b.enqueue_jobs$handle_failed_job_BANG_ invokeStatic enqueue_jobs.clj 62]}]
:trace
[...]}}
The exception will have data to show its provenance:
(ex-data e) ;; -> {:type :proletarian.retry/failed-job-fn-error}
The changes are now pushed to main - please give it a try!
It was my bad to not to explain the situation clearly from the start, indeed it only happens if ::on-polling-error is configured to return false.
I have checked the changes and ran it with my example for infinite error loop. Indeed, error message is clearer now, but it's still running infinitely - which as I understand from your comment upper you're already aware.
Just to be sure I get you right, are you saying if the user has specifically overwrote ::on-polling-error to return false, they should be handling this case? (or at least made aware by the logging changes you introduced)
In my opinion as a user of this great library, this could still hurt the users in the sense that it would take time to react, and the consequences could be great, imagine this infinitely run job one to write something to DynamoDB, it would scale the DB up and up to meet increased Write Capacity Units and could technically make quite a surprise on the bill day. :)
Yeah, I can see that it might get you into not-so-great situations. But the alternative, to happily let clearly incorrect code continue to run, is not great either.
First, I would question the need to customize ::on-polling-error in the first place. The default is to stop the worker, since something obviously is wrong and the current production code is clearly not able to handle it. So I would ask why you've provided a false-returning ::on-polling-error?
If you really need that function, then a workaround would be to check if (:type (ex-data e)) is :proletarian.retry/failed-job-fn-error. The you should return true so that the worker can stop, and the consequences you sketch out wouldn't happen.
Out of curiosity, what kind of work are you doing in failed-job-fn that is throwing exceptions?
Sorry for the very late reply!
So I would ask why you've provided a false-returning ::on-polling-error?
The worker executes jobs with side effects, so its possible for them to fail, and I think the intention setting ::on-polling-error to false is to keep worker running in such cases.
Out of curiosity, what kind of work are you doing in failed-job-fn that is throwing exceptions?
We're tracking the status of the runing job in an entity that it's related to. We're setting that status in the entity to "failed" to render it in frontend in failed-job-fn. So it's a db call, hence why there's a risk for it to fail.
So I would ask why you've provided a false-returning ::on-polling-error?
The worker executes jobs with side effects, so its possible for them to fail, and I think the intention setting ::on-polling-error to false is to keep worker running in such cases. […] We're tracking the status of the runing job in an entity that it's related to. We're setting that status in the entity to "failed" to render it in frontend in failed-job-fn. So it's a db call, hence why there's a risk for it to fail.
Any side-effecting job worker might fail, this is normal and expected. Since your domain cares about the failures, you should catch any exceptions you care about in the job worker and update the entity with the failed state right there, in the job worker (not in the more generic and infrastructure-y failed-job-fn). The job worker would then finish sucessfully, even if it "failed". It was successful on the infrastructure level, while it failed on the domain level.
Does this make sense?
Let me know if you have any follow-up questions regarding this. I'm closing the issue in the meantime.