[BUG] Handle auto refresh cache race condition
Tracking issue
Potentially closes: #5335
Why are the changes needed?
Propeller v1.12.0 introduced a bug in which child/external workflow status was not propagated back up to the parent workflow.
Not able to repro exactly. Current theory is that there's a race condition in which an item be in the processing set (which was introduced in new Flyte release) while not being in the workqueue. Due to if item, ok := value.(Item); !ok || (ok && !item.IsTerminal() && !w.processing.Contains(k)) { (in enqueueBatches), this would cause an item to no longer get added to the workqueue to then be re-synced.
Why we think this happens:
-
the item (workflow) is still in the LruCache as we keep getting status for it in GetStatus.
-
If the item were not in the cache, then the item would get re-added to the workqueue. If an item were in the workqueue, then it'd be included as part of the syncItem process that's trigged in the auto_refresh's sync. Sync grabs batches off the workqueue.
-
enqueueBatches adds items to the workqueue. An item only gets added to the workqueue if it's not in processing among other conditions.
-
gorm logs indicate that admin is not getting GetExecution requests for the child workflow that's status is not updating.
-
the addition of the processing sync.set was the only change that stood out in between flyte 1.11 and 1.12.
What changes were proposed in this pull request?
We want to keep the processing optimization to reduce to overhead of adding duplicate items to the workqueue.
We swap out the processing set in favor of a map in which they keys are the same set and the values are a timestamp of when the item was added to processing. We then check for how long the item has been in processing - if an item has been in processing for 10 sync periods we "evict" it from processing such that the item will get re-added to the workqueue.
How was this patch tested?
- added a simple unit test for the inProcessing expiration check
- ran a workflow launching external wf -> ensured that status was propagated to the parent.
Setup process
Screenshots
Check all the applicable boxes
- [x] I updated the documentation accordingly.
- [x] All new and existing tests passed.
- [x] All commits are signed-off.
Related PRs
Docs link
Codecov Report
Attention: Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.
Project coverage is 61.10%. Comparing base (
ba3647f) to head (56870aa). Report is 127 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| flytestdlib/cache/auto_refresh.go | 92.30% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #5406 +/- ##
=======================================
Coverage 61.10% 61.10%
=======================================
Files 793 793
Lines 51156 51164 +8
=======================================
+ Hits 31257 31264 +7
- Misses 17027 17028 +1
Partials 2872 2872
| Flag | Coverage Δ | |
|---|---|---|
| unittests-datacatalog | 69.31% <ø> (ø) |
|
| unittests-flyteadmin | 58.90% <ø> (ø) |
|
| unittests-flytecopilot | 17.79% <ø> (ø) |
|
| unittests-flytectl | 68.31% <ø> (ø) |
|
| unittests-flyteidl | 79.30% <ø> (ø) |
|
| unittests-flyteplugins | 61.94% <ø> (ø) |
|
| unittests-flytepropeller | 57.32% <ø> (ø) |
|
| unittests-flytestdlib | 65.80% <92.30%> (+0.04%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Is there any plan to create a fix release with this fix?
Thank you for working on this. Once the next release is available I can test this and I'll report back if the issue is solved.
Is there any plan to create a fix release with this fix?
@andresgomezfrr Yes, we are validating a new release end of this week and barring any issues will get an official release out next week.
@andresgomezfrr @pablocasares there's a RC that contains this fix. I'm unsure of when a final release containing this change will be made. I'll ping when that happens.