cadence icon indicating copy to clipboard operation
cadence copied to clipboard

Concurrency control of inflight workflows

Open longquanzheng opened this issue 6 years ago • 5 comments

In some cases, user may want to have only a fixed number of workflow running in parallel, say X. This could be done in a server side if it blocks the StartWF requests until the current active number of workflows is less than X. This is useful especially some users use Kafka message to trigger starting workflows

longquanzheng avatar Aug 16 '19 21:08 longquanzheng

@longquanzheng Can you explain what is this feature about?

meiliang86 avatar Oct 29 '19 21:10 meiliang86

@meiliang86 added more text to the issue. I think it could also be done as a client feature as well for a decision worker host, with less strong guarantee overall.

longquanzheng avatar Nov 03 '20 05:11 longquanzheng

@longquanzheng When you say "block start", does it mean that Cadence rejects these requests, or Cadence accepts these requests but delay the actual start (i.e. first decision task) of the workflow?

meiliang86 avatar Jul 22 '21 18:07 meiliang86

Given kakfa in the example: is this intended as like a back-pressure system? So you maintain like 10 workflows processing a kafka stream, and never go over 10, and that serves as your rate-limiter?

If so: interesting. And somewhat hard to maintain externally, so it could make sense to roll it in. Though I don't know of any such setups in cadence already, beyond the inbound-RPS rate limiter. Everything I've seen so far just controls how fast an unbounded queue is processed, not how fast things enter the queue.

Another way to do this may be to make an efficient per-tasklist "active workflows count" endpoint. We've wanted a way to gather that for client metrics, so they can see if their queue is increasing in size and not just in delay, and it could be reused here: poll until active<10, then submit one. In a multi-client setup that could spike above 10 until it drains back below 10, as a herd all attempts to progress at once, but there are fairly simple ways to mitigate that to like 95%+ accuracy (shared mutex, e.g. in redis). To reach 100% accuracy would probably require a database lock, which would need to be internal.

Groxx avatar Jul 22 '21 18:07 Groxx

@longquanzheng When you say "block start", does it mean that Cadence rejects these requests, or Cadence accepts these requests but delay the actual start (i.e. first decision task) of the workflow?

I meant that Cadence will hold the request like "PollForActivityTask". But that's just one option. We could certainly do delay start too, but we can discuss them as different options.

longquanzheng avatar Jul 22 '21 19:07 longquanzheng