argo-workflows
argo-workflows copied to clipboard
fix: re-enqueue parallelism when wf block by wf-level synchronization. Fixes #11984
Fixes #11984
Motivation
Given a certain parallelism limit. If a workflow couldn't execute due to workflow level synchronization lock (semaphore or mutex), the blocked workflow will release the parallelism lock to allow other workflows to execute.
Modifications
We add an additional inProgressNoLimit
map in parallelism Throttler,
Workflows blocked by workflow-level mutex/semaphore will
- get removed from
throttler.inProgress
(so parallelism limit can admit other workflows inthrottler.pending
) - added to
throttler.inProgressNoLimit
, allowsthrottler
toAdmit
the workflow in the future
The number of parallelism (concurrent workflow) still holds since the workflow block by mutex won't start until another workflow is finished and the mutex is released.
Verification
Add test test fail (verify original issue) Add changes test succeed
There's an issue with the current approach,
Original synchronization(parallelism):
- after block by mutex, it doesn't store any item in
wfQueue
, only enqueuewfQueue
after mutex is release, doesn't waste any resource
but the current PR approach (throttle.remove + add when block by mutex):
- results in an infinite loop of enqueue/dequeue to/from
wfQueue
until mutex release, wasting a lot of resource.
working on a new approach now
Updated with new approach, ready for review
There's an issue with the current approach,
Ah your original approach was surprisingly simple, I guess it was too simple 😅
- results in an infinite loop of enqueue/dequeue to/from
wfQueue
until mutex release, wasting a lot of resource.
Yea that makes sense. Although the queue in general has this a bit whenever a reconcile is not yet necessary. Was it a significant uptick in CPU? If it follows DEFAULT_REQUEUE_TIME
, it should take 10s between each, so I wouldn't think that would be significant
The optimal approach would effectively be to synchronize the two, er, synchronization primitives, i.e. synchronize the synchronization 😅
Was it a significant uptick in CPU? If it follows
DEFAULT_REQUEUE_TIME
, it should take 10s between each, so I wouldn't think that would be significant
In the original approach, every time we throttle.Remove/Add
it will check if there's a slot available then pop the key from throttle.pending
and add to wfQueue
with DefaultControllerRateLimiter
. With this, we will see the same mutex blocked workflow getting processed 10+ times in a 1 min interval
:shipit:
🙏
@tczhao do u think this patch will work on 3.4.x line?
@tczhao do u think this patch will work on 3.4.x line?
Hi @tooptoop4 , this should work on 3.4 without issue
@tczhao the test failures are unrelated, see #13002 and Slack thread where Isitha, Alan, and I were investigating it but haven't found the root cause yet
cc @Joibel