sdk-go icon indicating copy to clipboard operation
sdk-go copied to clipboard

Delay in starting a scheduled workflow

Open muralisrini opened this issue 2 years ago • 7 comments

Expected Behavior

The delay between Scheduled Workflow task and actual start of the workflow should be quite small. It certainly should not be order of a minute.

Actual Behavior

When running multiple workflows in a loop see some of workflows waiting for a long time in scheduled state (order of a minute) before starting. For example from temporals Web UI:

image

That task was scheduled at 17:07:28:35 and ran at 17:08:28:12.

This was seen with a modified Mutex Sample. See Steps below for details please.

Few points:

  • As mentioned, this does not happen with all workflows
  • We could not recreate with the Java SDK doing the equivalent of the Mutex Sample (in clojure actually)
  • We tried to keep close to the sample to avoid introducing extraneous parameters but it is possible some tweaks to the sample are required to run correctly. Having said that, we did try adding more parameters (such as MaxConcurrentActivityTaskPollers) to no avail.
  • While the recreate uses temporalite, we first saw the problem in temporal service deployment.

There is also some discussion on the forum : https://community.temporal.io/t/delay-in-scheduling-a-workflow/5996

Steps to Reproduce the Problem

Setup

  1. Pull https://github.com/muralisrini/samples-go/tree/mutex_add_workflows
  2. cd to the mutex/worker directory and go build
  3. cd to the mutex/starter directory and go build
  4. run ./temporalite start --namespace default
  5. run worker/worker # call this Worker window

Start

  1. run starter/starter # call this Starter window

The starter runs the original starter code with 2 workflows - but in a loop of 8 to make 18 works in total. You should see the "Worker Window" stall for a long time for some workflows. The Web UI should show delays similar to the one in the picture above.

Specifications

  • Version: v0.17.0
  • Platform: Ubuntu

muralisrini avatar Oct 10 '22 21:10 muralisrini

Anyone take a look at this ?

muralisrini avatar Dec 13 '22 16:12 muralisrini

I believe this is due to your starter explicitly waiting on workflow completion which would block on the lock:

https://github.com/muralisrini/samples-go/blob/mutex_add_workflows/mutex/starter/main.go#L52-L60

If you remove this (start executions async and dont wait for result) you should not see the delay you are explaining in this issue

tsurdilo avatar Dec 13 '22 22:12 tsurdilo

I am seeing this issue as well without the start executions waiting for results at all. The worker seems to stall/stop receiving tasks under certain conditions outlined here. I have observed the following:

  • as long as a particular taskQueue is kept busy no delay will happen
  • when a taskQueue becomes empty and additional workflows are then submitted, the delay will be experienced
  • nothing seems to process for approx. 1m from the time the taskQueue becomes empty
  • during delay condition - If worker is stopped and restarted, workflows immediately start processing again
  • if multiple workers are servicing a taskQueue, delay will not happen

I have reproduced this with simple changes to the mutex sample:

starter/main.go https://github.com/manetumatt/samples-go/blob/debug1/mutex/starter/main.go

I also shortened the critical section delay to 500ms:

mutex_workflow.go https://github.com/manetumatt/samples-go/blob/debug1/mutex/mutex_workflow.go#L208

This sample code executes 5 workflows which start processing immediately. The code then sleeps for 10 seconds and then executes a second set of 5 workflows. The second set will not start processing for nearly a minute.

manetumatt avatar Dec 14 '22 19:12 manetumatt

Thanks for update. Can you share the server version you are running against?

tsurdilo avatar Dec 14 '22 21:12 tsurdilo

Thank you for the quick reply. First encountered the issues against v1.18.0. Switched to Temporalite v0.3.0 (server 1.19.0 I think) for local debugging and testing. Saw the worker stall issue against both backends.

from Temporalite: (base) mltmbp:samples mtaylor$ tctl adm cl d { "supportedClients": { "temporal-cli": "\u003c2.0.0", "temporal-go": "\u003c2.0.0", "temporal-java": "\u003c2.0.0", "temporal-php": "\u003c2.0.0", "temporal-server": "\u003c2.0.0", "temporal-typescript": "\u003c2.0.0", "temporal-ui": "\u003c3.0.0" }, "serverVersion": "1.19.0", "membershipInfo": { "currentHost": { "identity": "127.0.0.1:7233"

manetumatt avatar Dec 14 '22 22:12 manetumatt

Still seeing this issue with latest version. Anyone had a chance to look at this?

manetumatt avatar Apr 12 '23 14:04 manetumatt

I believe this is fixed by https://github.com/temporalio/temporal/pull/4562, which is included in server 1.21.1 and above. Please try the latest server version and see if you can still reproduce it.

dnr avatar Jul 26 '23 06:07 dnr