argo-workflows fix: re-enqueue parallelism when wf block by wf-level synchronization. Fixes #11984

Fixes #11984

Motivation

Given a certain parallelism limit. If a workflow couldn't execute due to workflow level synchronization lock (semaphore or mutex), the blocked workflow will release the parallelism lock to allow other workflows to execute.

Modifications

We add an additional inProgressNoLimit map in parallelism Throttler, Workflows blocked by workflow-level mutex/semaphore will

get removed from throttler.inProgress (so parallelism limit can admit other workflows in throttler.pending)
added to throttler.inProgressNoLimit, allows throttler to Admit the workflow in the future

The number of parallelism (concurrent workflow) still holds since the workflow block by mutex won't start until another workflow is finished and the mutex is released.

Verification

Add test test fail (verify original issue) Add changes test succeed

Mar 14 '24 04:03 tczhao

There's an issue with the current approach,

Original synchronization(parallelism):

after block by mutex, it doesn't store any item in wfQueue, only enqueue wfQueue after mutex is release, doesn't waste any resource

but the current PR approach (throttle.remove + add when block by mutex):

results in an infinite loop of enqueue/dequeue to/from wfQueue until mutex release, wasting a lot of resource.

working on a new approach now

Mar 15 '24 13:03 tczhao

Updated with new approach, ready for review

Mar 15 '24 15:03 tczhao

There's an issue with the current approach,

Ah your original approach was surprisingly simple, I guess it was too simple 😅

results in an infinite loop of enqueue/dequeue to/from wfQueue until mutex release, wasting a lot of resource.

Yea that makes sense. Although the queue in general has this a bit whenever a reconcile is not yet necessary. Was it a significant uptick in CPU? If it follows DEFAULT_REQUEUE_TIME, it should take 10s between each, so I wouldn't think that would be significant

The optimal approach would effectively be to synchronize the two, er, synchronization primitives, i.e. synchronize the synchronization 😅

Mar 15 '24 16:03 agilgur5

Was it a significant uptick in CPU? If it follows DEFAULT_REQUEUE_TIME, it should take 10s between each, so I wouldn't think that would be significant

In the original approach, every time we throttle.Remove/Add it will check if there's a slot available then pop the key from throttle.pending and add to wfQueue with DefaultControllerRateLimiter. With this, we will see the same mutex blocked workflow getting processed 10+ times in a 1 min interval

Mar 18 '24 03:03 tczhao

:shipit:

Mar 22 '24 09:03 tooptoop4

🙏

Apr 20 '24 23:04 tooptoop4

@tczhao do u think this patch will work on 3.4.x line?

May 04 '24 08:05 tooptoop4

@tczhao do u think this patch will work on 3.4.x line?

Hi @tooptoop4 , this should work on 3.4 without issue

May 05 '24 15:05 tczhao

@tczhao the test failures are unrelated, see #13002 and Slack thread where Isitha, Alan, and I were investigating it but haven't found the root cause yet

May 06 '24 03:05 agilgur5

cc @Joibel

Aug 24 '24 21:08 tooptoop4

argo-workflows argo-workflows copied to clipboard

fix: re-enqueue parallelism when wf block by wf-level synchronization. Fixes #11984

Motivation

Modifications

Verification

argo-workflows
argo-workflows copied to clipboard