conductor icon indicating copy to clipboard operation
conductor copied to clipboard

Workflows executions are getting stuck

Open arorashivam opened this issue 1 year ago • 1 comments

Describe the bug Workflow executions are getting stuck due to tasks taking too long to schedule.

Further debugging details:

  1. In sweeper flow, If a task is in SCHEDULED state, the un-ack time is set as workflowTimeout if taskDefinition is not present. In other words the sweeper will now only sweep this workflow after workflowTimeout.
  2. Note: I am not sure if we re-set the un-ack timeout once task moves from SCHEDULED to IN_PROGRESS
  3. Now a workflow execution whenever reaches a state where it depends on sweeper to trigger the decide would remain stuck.

Details Conductor version: 3.20.0 Persistence implementation: Postgres Queue implementation: Dynoqueues Lock: Redis Workflow definition: N/A Task definition: N/A Event handler definition: N/A

To Reproduce Steps to reproduce the behavior:

Go to '...' Click on '....' Scroll down to '....' See error Expected behavior Sweeper to continue sweeping a workflow once a task moves from SCHEDULED to IN_PROGRESS

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

arorashivam avatar Jul 16 '24 13:07 arorashivam

I'd like to add some additional context to this issue.

As noted above,

In sweeper flow, If a task is in SCHEDULED state, the un-ack time is set as workflowTimeout if taskDefinition is not present. In other words the sweeper will now only sweep this workflow after workflowTimeout.

This issue has been observed for async System Tasks, but could also occur for SIMPLE tasks if the timeouts are not set on the TaskDefinition but a timeout is set on the Workflow. These types of tasks do not transition from SCHEDULED to IN PROGRESS within a "decide", so the Sweep can pick them up in the SCHEDULED state.

Having a timely workflow sweep is critical in the cases where an execution lock cannot be obtained for some reason, as the decide is deliberately deferred to the sweep in this case. Furthermore, we have seen issues with the JOIN when it was set to synchronous as it does not trigger a decide when it completes (this was resolved when it was reverted to async).

It seems like there should be another setting "maxSweepDelay" to use as the fallback unack time, set either at the workflow level, system level or both.

lbestatlas avatar Sep 11 '24 03:09 lbestatlas

👋 Hi @arorashivam @lbestatlas

We're currently reviewing open issues in the Conductor OSS backlog, and noticed that this issue hasn't been addressed.

To help us keep the backlog focused and actionable, we’d love your input:

  • Is this issue still relevant?
  • Has the problem been resolved in the latest version v3.21.12?
  • Do you have any additional context or updates to provide?

If we don’t hear back in the next 14 days, we’ll assume this issue is no longer active and will close it for housekeeping. Of course, if it's still a valid issue, just let us know and we’ll keep it open!

Thanks for contributing to Conductor OSS! We appreciate your support. 🙌

Jeff Bull

Developer Community Manager | Orkes

DM on Conductor Slack Email me!

jeffbulltech avatar Feb 27 '25 01:02 jeffbulltech

Agreed the MaxPostponeDurationSeconds setting added in v3.21.12 does mitigate the issue.

lbestatlas avatar Mar 06 '25 06:03 lbestatlas

Agreed the MaxPostponeDurationSeconds setting added in v3.21.12 does mitigate the issue.

Thanks @lbestatlas I'll go ahead and close this issue.

jeffbulltech avatar Mar 06 '25 17:03 jeffbulltech