sdk-go
sdk-go copied to clipboard
Delay in starting a scheduled workflow
Expected Behavior
The delay between Scheduled Workflow task and actual start of the workflow should be quite small. It certainly should not be order of a minute.
Actual Behavior
When running multiple workflows in a loop see some of workflows waiting for a long time in scheduled state (order of a minute) before starting. For example from temporals Web UI:
That task was scheduled at 17:07:28:35 and ran at 17:08:28:12.
This was seen with a modified Mutex Sample. See Steps below for details please.
Few points:
- As mentioned, this does not happen with all workflows
- We could not recreate with the Java SDK doing the equivalent of the Mutex Sample (in clojure actually)
- We tried to keep close to the sample to avoid introducing extraneous parameters but it is possible some tweaks to the sample are required to run correctly. Having said that, we did try adding more parameters (such as
MaxConcurrentActivityTaskPollers
) to no avail. - While the recreate uses
temporalite
, we first saw the problem intemporal
service deployment.
There is also some discussion on the forum : https://community.temporal.io/t/delay-in-scheduling-a-workflow/5996
Steps to Reproduce the Problem
Setup
- Pull https://github.com/muralisrini/samples-go/tree/mutex_add_workflows
- cd to the
mutex/worker
directory andgo build
- cd to the
mutex/starter
directory andgo build
- run
./temporalite start --namespace default
- run
worker/worker
# call this Worker window
Start
- run
starter/starter
# call this Starter window
The starter runs the original starter code with 2 workflows - but in a loop of 8 to make 18 works in total. You should see the "Worker Window" stall for a long time for some workflows. The Web UI should show delays similar to the one in the picture above.
Specifications
- Version: v0.17.0
- Platform: Ubuntu
Anyone take a look at this ?
I believe this is due to your starter explicitly waiting on workflow completion which would block on the lock:
https://github.com/muralisrini/samples-go/blob/mutex_add_workflows/mutex/starter/main.go#L52-L60
If you remove this (start executions async and dont wait for result) you should not see the delay you are explaining in this issue
I am seeing this issue as well without the start executions waiting for results at all. The worker seems to stall/stop receiving tasks under certain conditions outlined here. I have observed the following:
- as long as a particular taskQueue is kept busy no delay will happen
- when a taskQueue becomes empty and additional workflows are then submitted, the delay will be experienced
- nothing seems to process for approx. 1m from the time the taskQueue becomes empty
- during delay condition - If worker is stopped and restarted, workflows immediately start processing again
- if multiple workers are servicing a taskQueue, delay will not happen
I have reproduced this with simple changes to the mutex sample:
starter/main.go https://github.com/manetumatt/samples-go/blob/debug1/mutex/starter/main.go
I also shortened the critical section delay to 500ms:
mutex_workflow.go https://github.com/manetumatt/samples-go/blob/debug1/mutex/mutex_workflow.go#L208
This sample code executes 5 workflows which start processing immediately. The code then sleeps for 10 seconds and then executes a second set of 5 workflows. The second set will not start processing for nearly a minute.
Thanks for update. Can you share the server version you are running against?
Thank you for the quick reply. First encountered the issues against v1.18.0. Switched to Temporalite v0.3.0 (server 1.19.0 I think) for local debugging and testing. Saw the worker stall issue against both backends.
from Temporalite:
(base) mltmbp:samples mtaylor$ tctl adm cl d { "supportedClients": { "temporal-cli": "\u003c2.0.0", "temporal-go": "\u003c2.0.0", "temporal-java": "\u003c2.0.0", "temporal-php": "\u003c2.0.0", "temporal-server": "\u003c2.0.0", "temporal-typescript": "\u003c2.0.0", "temporal-ui": "\u003c3.0.0" }, "serverVersion": "1.19.0", "membershipInfo": { "currentHost": { "identity": "127.0.0.1:7233"
Still seeing this issue with latest version. Anyone had a chance to look at this?
I believe this is fixed by https://github.com/temporalio/temporal/pull/4562, which is included in server 1.21.1 and above. Please try the latest server version and see if you can still reproduce it.