action-scheduler icon indicating copy to clipboard operation
action-scheduler copied to clipboard

Race conditions affecting ActionScheduler_QueueRunner::maybe_dispatch_async_request()

Open brandonpayton opened this issue 3 years ago • 1 comments
trafficstars

There are two race conditions that affect ActionScheduler_QueueRunner:

  1. https://github.com/woocommerce/action-scheduler/blob/d9362151fbdb47620ccbe322ae42bddc004a0f2c/classes/ActionScheduler_QueueRunner.php#L103-L105 is a race because of the time between checking whether there is already a lock and attempting to obtain the lock. It needs to be a single action, where the code attempts to obtain a lock and does not proceed unless it succeeds.
  2. https://github.com/woocommerce/action-scheduler/blob/d9362151fbdb47620ccbe322ae42bddc004a0f2c/classes/ActionScheduler_OptionLock.php#L27 is a race because multiple threads of execution can get the same lock option value and try to update at the same time, and nothing will stop each from updating the option value. One way to address this is to write an UPDATE wp_options query with a WHERE that is only satisfied if the option still has the previous, expected value. If the update does not affect any rows, then you did not successfully obtain the lock.

These can be argued to be race conditions based on the code, but we've also seen the effects of this in practice. While troubleshooting a customer issue, I added logging for the action_scheduler_lock_async-request-runner option and found the following in the logs:

[17-Dec-2021 06:44:13 UTC] update_option_action_scheduler_lock_async-request-runner OLD ('1639723449') NEW (1639723513)
[17-Dec-2021 06:44:14 UTC] update_option_action_scheduler_lock_async-request-runner OLD ('1639723449') NEW (1639723514)

Above, both option updates have the same old value, and both are setting a new value. This means two different things believe they obtained the same lock at the same time.

This was found while troubleshooting frequent MySQL deadlock for action scheduler table updates on WordPress.com Atomic sites.

brandonpayton avatar Dec 29 '21 00:12 brandonpayton

Thanks for flagging this.

It does look possible that this could even lead to max_concurrent_batches being exceeded.

barryhughes avatar Jan 19 '22 18:01 barryhughes

Same issue that is crashing my site. We are currently looking into removing WooCommerce due to this locking up MySQL and crashing the site daily. Numerous issues with action scheduler but this is one that makes the site completely broken. Almost a year and the issue still persists.

jpSimkins avatar Sep 29 '22 20:09 jpSimkins

Sorry to hear that, @jpSimkins.

Do you have any more details as to what is causing the crash? I'd assume it is not the race condition described in the opening post by itself (at least, not directly), but rather something that happens following that.

Something you could consider here is disabling async based queue runners using a snippet like this one:

add_filter( 'action_scheduler_allow_async_request_runner', '__return_false' );

So long as WordPress scheduled events (ie, WP Cron) is working, the queue should still be processed and—if you have access to WP CLI in your hosting environment—you could additionally consider augmenting default WP Cron-based processing with a crontab rule along these lines:

* * * * * wp action-scheduler run --path=/path/to/wp

By eliminating the async queue runners (and, specifically, eliminating the related option lock race condition), this would reduce the chances of two or more attempts to stake claims at the same time (and the crontab rule, if you can set that up, would help to ensure that the queue of actions are still processed at a decent tempo).

barryhughes avatar Sep 29 '22 22:09 barryhughes

I am testing with the hook:

add_filter( 'action_scheduler_allow_async_request_runner', '__return_false' );

I will create a new ticket for this since my issue seems to be related but not exact. I will collect more information for this and post it when I have organized data for this. Thank you

jpSimkins avatar Sep 29 '22 22:09 jpSimkins