Cavalcade icon indicating copy to clipboard operation
Cavalcade copied to clipboard

Auto retry failed jobs

Open xpd opened this issue 7 years ago • 4 comments

Since out of the box monitoring is missing it would be nice to have the possibility of automatically retrying failed jobs. Along with a timeout this would have prevented almost all issues we've had.

xpd avatar Jun 26 '18 16:06 xpd

I also would love this. Cavalcade is great, but failed jobs are causing a massive headache.

I'd be happy to work on a PR if there is any guidance on how people would prefer to handle edge cases and retry.

remkade avatar Jan 10 '19 20:01 remkade

Hey folks. We run this and have the same problem. Cavalcade assumes all failed jobs need to be investigated and manually resumed otherwise they don’t run again. On the other hand the reason for a job to fail is not always a problem with the code itself. For example external factors such as the connectivity problems. So I found that failed jobs with an interval are best marked as completed as this results in them being rescheduled. Jobs without interval are best marked as waiting so they are run again.

I built a self health job that corrects failed jobs older than x minutes which allows some time for an investigation. Ideally any job that repeatedly fails a number of times should not be corrected as a repeatable fail is likely caused by an issue on the application level. This should result in a notification of some kind?

If it is of interest I could spend a bit of time polishing this solution up and see if we can share it?

svandragt avatar Jan 10 '19 21:01 svandragt

Yes, very much so. I've been watching Cavalcade in hopes of resolving our constant issues with failed scheduled post.

It's possible that just by switching, this problem would be resolved. But in case it isn't, a retry mechanism would make sense. WP cron doesn't have one at all.

archon810 avatar Jan 10 '19 21:01 archon810

I have a MySQL Event (cronjob inside mysql basically) I made for a specific woocommerce subscriptions job that can't stop processing if it fails.

Here's what it looks like:

CREATE EVENT IF NOT EXISTS `fix_cavalcade`
ON SCHEDULE EVERY 1 HOUR
DISABLE ON SLAVE
COMMENT 'Sets cavalcade job status back to waiting + 2 minutes if it fails'
DO
  UPDATE wordpress.wp_cavalcade_jobs
  SET status = 'waiting'
  WHERE id = 18
    AND
    site = 86
    AND
    (
      status = 'failed'
      OR
      nextrun < NOW() - INTERVAL 2 HOUR
    );

If you're running this inside of RDS make sure you update you parameter group to enable the event scheduler! It took me weeks to figure out why it wasn't running. event_scheduler = ON.

remkade avatar Jan 14 '19 16:01 remkade