flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[BUG] Handle auto refresh cache race condition

Open pvditt opened this issue 1 year ago • 1 comments

Tracking issue

Potentially closes: #5335

Why are the changes needed?

Propeller v1.12.0 introduced a bug in which child/external workflow status was not propagated back up to the parent workflow.

Not able to repro exactly. Current theory is that there's a race condition in which an item be in the processing set (which was introduced in new Flyte release) while not being in the workqueue. Due to if item, ok := value.(Item); !ok || (ok && !item.IsTerminal() && !w.processing.Contains(k)) { (in enqueueBatches), this would cause an item to no longer get added to the workqueue to then be re-synced.

Why we think this happens:

  • the item (workflow) is still in the LruCache as we keep getting status for it in GetStatus.

  • If the item were not in the cache, then the item would get re-added to the workqueue. If an item were in the workqueue, then it'd be included as part of the syncItem process that's trigged in the auto_refresh's sync. Sync grabs batches off the workqueue.

  • enqueueBatches adds items to the workqueue. An item only gets added to the workqueue if it's not in processing among other conditions.

  • gorm logs indicate that admin is not getting GetExecution requests for the child workflow that's status is not updating.

  • the addition of the processing sync.set was the only change that stood out in between flyte 1.11 and 1.12.

What changes were proposed in this pull request?

We want to keep the processing optimization to reduce to overhead of adding duplicate items to the workqueue.

We swap out the processing set in favor of a map in which they keys are the same set and the values are a timestamp of when the item was added to processing. We then check for how long the item has been in processing - if an item has been in processing for 10 sync periods we "evict" it from processing such that the item will get re-added to the workqueue.

How was this patch tested?

  • added a simple unit test for the inProcessing expiration check
  • ran a workflow launching external wf -> ensured that status was propagated to the parent.

Setup process

Screenshots

Check all the applicable boxes

  • [x] I updated the documentation accordingly.
  • [x] All new and existing tests passed.
  • [x] All commits are signed-off.

Related PRs

Docs link

pvditt avatar May 22 '24 05:05 pvditt

Codecov Report

Attention: Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Project coverage is 61.10%. Comparing base (ba3647f) to head (56870aa). Report is 127 commits behind head on master.

Files Patch % Lines
flytestdlib/cache/auto_refresh.go 92.30% 1 Missing :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5406   +/-   ##
=======================================
  Coverage   61.10%   61.10%           
=======================================
  Files         793      793           
  Lines       51156    51164    +8     
=======================================
+ Hits        31257    31264    +7     
- Misses      17027    17028    +1     
  Partials     2872     2872           
Flag Coverage Δ
unittests-datacatalog 69.31% <ø> (ø)
unittests-flyteadmin 58.90% <ø> (ø)
unittests-flytecopilot 17.79% <ø> (ø)
unittests-flytectl 68.31% <ø> (ø)
unittests-flyteidl 79.30% <ø> (ø)
unittests-flyteplugins 61.94% <ø> (ø)
unittests-flytepropeller 57.32% <ø> (ø)
unittests-flytestdlib 65.80% <92.30%> (+0.04%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 22 '24 05:05 codecov[bot]

Is there any plan to create a fix release with this fix?

andresgomezfrr avatar May 27 '24 08:05 andresgomezfrr

Thank you for working on this. Once the next release is available I can test this and I'll report back if the issue is solved.

pablocasares avatar May 27 '24 09:05 pablocasares

Is there any plan to create a fix release with this fix?

@andresgomezfrr Yes, we are validating a new release end of this week and barring any issues will get an official release out next week.

pvditt avatar May 30 '24 17:05 pvditt

@andresgomezfrr @pablocasares there's a RC that contains this fix. I'm unsure of when a final release containing this change will be made. I'll ping when that happens.

pvditt avatar Jun 13 '24 07:06 pvditt