argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

fix: re-enqueue parallelism when wf block by wf-level synchronization. Fixes #11984

Open tczhao opened this issue 11 months ago • 8 comments

Fixes #11984

Motivation

Given a certain parallelism limit. If a workflow couldn't execute due to workflow level synchronization lock (semaphore or mutex), the blocked workflow will release the parallelism lock to allow other workflows to execute.

Modifications

We add an additional inProgressNoLimit map in parallelism Throttler, Workflows blocked by workflow-level mutex/semaphore will

  • get removed from throttler.inProgress (so parallelism limit can admit other workflows in throttler.pending)
  • added to throttler.inProgressNoLimit, allows throttler to Admit the workflow in the future

The number of parallelism (concurrent workflow) still holds since the workflow block by mutex won't start until another workflow is finished and the mutex is released.

Verification

Add test test fail (verify original issue) Add changes test succeed

tczhao avatar Mar 14 '24 04:03 tczhao

There's an issue with the current approach,

Original synchronization(parallelism):

  • after block by mutex, it doesn't store any item in wfQueue, only enqueue wfQueue after mutex is release, doesn't waste any resource

but the current PR approach (throttle.remove + add when block by mutex):

  • results in an infinite loop of enqueue/dequeue to/from wfQueue until mutex release, wasting a lot of resource.

working on a new approach now

tczhao avatar Mar 15 '24 13:03 tczhao

Updated with new approach, ready for review

tczhao avatar Mar 15 '24 15:03 tczhao

There's an issue with the current approach,

Ah your original approach was surprisingly simple, I guess it was too simple 😅

  • results in an infinite loop of enqueue/dequeue to/from wfQueue until mutex release, wasting a lot of resource.

Yea that makes sense. Although the queue in general has this a bit whenever a reconcile is not yet necessary. Was it a significant uptick in CPU? If it follows DEFAULT_REQUEUE_TIME, it should take 10s between each, so I wouldn't think that would be significant

The optimal approach would effectively be to synchronize the two, er, synchronization primitives, i.e. synchronize the synchronization 😅

agilgur5 avatar Mar 15 '24 16:03 agilgur5

Was it a significant uptick in CPU? If it follows DEFAULT_REQUEUE_TIME, it should take 10s between each, so I wouldn't think that would be significant

In the original approach, every time we throttle.Remove/Add it will check if there's a slot available then pop the key from throttle.pending and add to wfQueue with DefaultControllerRateLimiter. With this, we will see the same mutex blocked workflow getting processed 10+ times in a 1 min interval

tczhao avatar Mar 18 '24 03:03 tczhao

:shipit:

tooptoop4 avatar Mar 22 '24 09:03 tooptoop4

🙏

tooptoop4 avatar Apr 20 '24 23:04 tooptoop4

@tczhao do u think this patch will work on 3.4.x line?

tooptoop4 avatar May 04 '24 08:05 tooptoop4

@tczhao do u think this patch will work on 3.4.x line?

Hi @tooptoop4 , this should work on 3.4 without issue

tczhao avatar May 05 '24 15:05 tczhao

@tczhao the test failures are unrelated, see #13002 and Slack thread where Isitha, Alan, and I were investigating it but haven't found the root cause yet

agilgur5 avatar May 06 '24 03:05 agilgur5

cc @Joibel

tooptoop4 avatar Aug 24 '24 21:08 tooptoop4