conductor Latency tradeoff could be better when polling with a long timeout and count

Describe the bug If all workers poll for tasks with a count=10 for example, and a timeout=5s, very few things get polled during low activities.

This is not really a bug, but maybe a tradeoff to clarify / change.

Details Conductor version: 3.19.0

To Reproduce

create a workflow with 1 single task
start a worker polling for that task with count=2 and timeout=5s
execute the workflow
check how long the task stays in the queue
Repeat every 5 seconds

What happens is that the DAO implementations try at least "count" messages from the queue while the timeout is not elapsed.

MySQL: https://github.com/conductor-oss/conductor/blob/55268f0633969379874cc425ef32c9048bbddbfa/mysql-persistence/src/main/java/com/netflix/conductor/mysql/dao/MySQLQueueDAO.java#L349

Postgres: https://github.com/conductor-oss/conductor/blob/55268f0633969379874cc425ef32c9048bbddbfa/postgres-persistence/src/main/java/com/netflix/conductor/postgres/dao/PostgresQueueDAO.java#L169

Dyno: https://github.com/conductor-oss/conductor/blob/main/redis-persistence/src/main/java/com/netflix/conductor/redis/dao/DynoQueueDAO.java#L96, which relies on https://github.com/Netflix/dyno-queues/blob/dev/dyno-queues-redis/src/main/java/com/netflix/dyno/queues/redis/RedisDynoQueue.java#L343

Expected behavior In case of low activity, the pop method should wait less time, and return a task anyways instead of trying to maximise the number of tasks polled.

Why?

I think we have multiple cases, depending on the number of tasks to poll:

high activity: very high number of tasks in the queue, so long polling is barely relevant, and it's easy to poll for count number of tasks. Count is important to achieve high throughput.
very low activity: tasks arrive in the queue at a slow rate, and so long polling is relevant to improve latency, but waiting for count tasks to be there is waiting for nothing. Long polling is more relevant than count in this case, and a task should be return earlier.
medium activity: tasks arrive in the queue so that maybe 50% of workers get their count filled, and 50% of them get less. In this case, we get some tasks being processed fast, while others wait much longer. With medium activity, a slower but snappier flow would be better, and thus count should matter a bit less than long polling.

As a consequence, I think count should not be a value that should be that hard on the polling, but its impact is more important when long polling makes less sense.

I guess there are multiple ways to fix it:

pop until count is 1 or more OR until the time has elapsed
pop until count is at least half of the requested count OR until the time has elapsed.
maybe depending on the value of pop, change the strategy: if pop is like 1000, then keep the current behaviour, as the server always expects high load and throughput is preferred. If count is low (like 3), then does it as a best effort with returning up to 3 within the time period (just like 1).

What do you think?

Apr 26 '24 08:04 Jiehong

👋 Hi @Jiehong

We're currently reviewing open issues in the Conductor OSS backlog, and noticed that this issue hasn't been addressed.

To help us keep the backlog focused and actionable, we’d love your input:

Is this issue still relevant?
Has the problem been resolved in the latest version v3.21.12?
Do you have any additional context or updates to provide?

If we don’t hear back in the next 14 days, we’ll assume this issue is no longer active and will close it for housekeeping. Of course, if it's still a valid issue, just let us know and we’ll keep it open!

Thanks for contributing to Conductor OSS! We appreciate your support. 🙌

Jeff Bull

Developer Community Manager | Orkes

DM on Conductor Slack Email me!

Feb 27 '25 00:02 jeffbulltech

The code linked has not changed since then, so the behaviour probably hasn't either (can't test with latest conductor version at the moment).

Thinking about it again, there might be another to deal with this issue: instead of all workers polling with a high count to favour throughput, a mix of count=1 and count>1 can better deal with low activity. That's a guess, but it seems likely.

Feb 27 '25 08:02 Jiehong

The code linked has not changed since then, so the behaviour probably hasn't either (can't test with latest conductor version at the moment).

Thinking about it again, there might be another to deal with this issue: instead of all workers polling with a high count to favour throughput, a mix of count=1 and count>1 can better deal with low activity. That's a guess, but it seems likely.

Thanks for getting back to me @Jiehong I'll make sure this issue remains open so it can be reviewed for an upcoming release.

Feb 27 '25 17:02 jeffbulltech

Latency tradeoff could be better when polling with a long timeout and count > 1 in MySQL and Postgres persistences

Jeff Bull