cli icon indicating copy to clipboard operation
cli copied to clipboard

[Bug] Unable to start a batch job if counting the workflows times out

Open dcohen8128 opened this issue 6 months ago • 7 comments

What are you really trying to do?

During some performance testing we ran into an issue and had to stop the test. We were left with about 30 million running workflows that would have taken a very long time to naturally drain so we wanted to batch terminate them.

Describe the bug

We attempted to start a batch terminate job through the CLI by running a command like temporal workflow terminate --query 'ExecutionStatus="Running" AND WorkflowType="TestWorkflow"' but it failed with the message "failed counting workflows from query: context deadline exceeded" and no batch job was submitted.

Minimal Reproduction

  1. On a cluster using a SQL advanced visibility store, spawn a few million workflows
  2. Attempt to batch terminate all of them

Environment/Versions

  • OS and processor: N/A
  • Temporal Version: Server 1.27.1, CLI 1.3.0
  • Are you using Docker or Kubernetes or building Temporal from source? N/A?

Additional context

dcohen8128 avatar Jul 30 '25 23:07 dcohen8128

Do you have your server metrics? Quite possible that your are running into timeouts on your cluster side.

Check your visibility latencies and errors

histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))

(sum(rate(service_errors{operation=~"ListOpenWorkflowExecutions|ListClosedWorkflowExecutions|ListWorkflowExecutions|ScanWorkflowExecutions|CountWorkflowExecutions",service_name="frontend"}[1m])

tsurdilo avatar Aug 05 '25 13:08 tsurdilo

If you don't see timeouts in service_errors (deadlineexceeded) you can try to set CLI option --command-timeout to large value maybe couple of minutes and see if that gives your cluster enough time to return its results back

https://docs.temporal.io/cli/cmd-options#command-timeout

tsurdilo avatar Aug 05 '25 13:08 tsurdilo

Yes the timeouts were server side. We had tried --command-timeout but that only affects client side timeouts so it didn't help.

dcohen8128 avatar Aug 05 '25 16:08 dcohen8128

ok. do you want to move this conversation to community slack if you want to go over your server side and maybe things to do to eliminate timeouts on visibility apis?

can we close this issue?

tsurdilo avatar Aug 05 '25 21:08 tsurdilo

I'm not concerned about the timeouts on the server side, we know that this cluster is not scaled to handle this many workflows due to constraints on our Postgres instance.

I think this issue should remain open; the CLI should be able to create a batch job even if it's unable to get the workflow count.

dcohen8128 avatar Aug 05 '25 21:08 dcohen8128

I think we may likely avoid the count if --yes is in use (it's only used for that prompt). Would that be acceptable? Not sure we want to silently continue on workflow count failure and not sure we want to add a special skip-count option just for the rare situation of workflow counts not working but batch that uses the same workflow query does.

cretz avatar Aug 05 '25 22:08 cretz

Yeah I'm OK with being able to skip it with the --yes flag.

dcohen8128 avatar Aug 19 '25 20:08 dcohen8128