yii2-queue icon indicating copy to clipboard operation
yii2-queue copied to clipboard

Infinite retry loop in RetryableJob because the canRetry/attempt not obeyed when Job/Worker segfaults

Open ldkafka opened this issue 6 years ago • 6 comments

What steps will reproduce the problem?

I am working on getting this info. It happens on a live system with a few thousand jobs per day where a few hundred segfault and get re-queued indefinitely.

The job implements \yii\queue\RetryableJobInterface and has: public function canRetry($attempt, $error) { return ($attempt < 3 ) && ($error instanceof TemporaryException); }

What's expected?

Not sure if the segfault is a Queue issue, but at least the "Attempts" mechanism should work so we do not end up in an infinite race... a job should really not be retried more than twice, but I get the attempt counter (in the logs) up to 400+ (then I have to flush the queue to stop this).

What do you get instead?

Infinite re-queuing. The segfault must happen in a very awkward place in between the attempt counter being increased and canRetry call...

Additional info

Using Redis queue.

Q A
Yii version 2.0.27
PHP version v7.0.33-0+deb9u5
Operating system Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3 (2019-09-02) x86_64 GNU/Linux

ldkafka avatar Sep 23 '19 04:09 ldkafka

A lot of jobs are left in the reserved state, which is also where the attempt counter is incremented via hincrby in the redis driver. I believe these to be all the jobs that have segfaulted, but then get re-run.

ldkafka avatar Sep 23 '19 08:09 ldkafka

It seems that the segfault is occurring after the job finishes (at the garbage collecting stage) in the Zend memory manager. Similar to documented bugs like https://bugs.php.net/bug.php?id=71662

Switching off the Zend_MM with USE_ZEND_ALLOC=0 stops the segfaults.

The question that remains is if the queue manager can deal with a segfault in the job and behave as expected in terms of queue/attempt management?

ldkafka avatar Sep 25 '19 05:09 ldkafka

No, it can't. Segfault can't be caught.

samdark avatar Sep 30 '19 07:09 samdark

I don't think the segfault needs to be caught. My thoughts are more along the line of adjusting the attempt increment/retry logic (so there is a safeguard before the job runs not after).

ldkafka avatar Sep 30 '19 07:09 ldkafka

Do you have an idea about implementation?

samdark avatar Sep 30 '19 08:09 samdark

I'll have a look

ldkafka avatar Oct 02 '19 04:10 ldkafka