cadence-java-client icon indicating copy to clipboard operation
cadence-java-client copied to clipboard

RPC timeout when forking thousands of subflows

Open monoc44 opened this issue 5 years ago • 1 comments

We created two flows (Flow1, Flow2) and one activity (Activity1) for testing Cadence.

  • Activity1 -> returns an integer
  • Flow1 -> calls Activity1 and uses the returned result to know how many sub-children of type Flow2 to create and run them asynchronously (for loop)
  • Flow2 -> returns "Hello World"

For testing the scalability of the system, we started:

  • docker-compose (up)
  • one worker configured to handle Flow1, Flow2 and Activity1 requests
  • set Activity1 to return 5000

What we observed:

  • Flow1 is created (status open in web client)
  • None of the 5000 expected Flow2 get created (at least none of them displayed in web client)
  • In Flow1, worker keeps iterating over the 5000 Flow2, then display an error log, and then re-iterate again (infinite loop, replay in action I suppose)
  • If we stop the worker and restarts it, it will go back to the infinite loop (replay)

Here is the error log:

[...]
18:15:02.792 INFO  nce.worflows.Flow1 - Starting Flow2-4895
18:15:02.792 INFO  nce.worflows.Flow1 - Starting Flow2-4896
18:15:02.792 INFO  nce.worflows.Flow1 - Starting Flow2-4897
18:15:02.793 INFO  nce.worflows.Flow1 - Starting Flow2-4898
18:15:02.793 INFO  nce.worflows.Flow1 - Starting Flow2-4899
18:15:03.870 WARN  adence.internal.common.Retryer -      - Retrying after failure
org.apache.thrift.TException: Rpc error:<ErrorResponse id=3 errorType=Timeout message=Request timeout after 1004ms>
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:505)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:480)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:918)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$10(WorkflowServiceTChannel.java:907)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:525)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:905)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:302)
	at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
	at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
	at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:302)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:262)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:230)
	at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)

Note: On the Cassandra side, it seems that table EXECUTIONS gets new rows inserted at each iteration and hence keep increasing despite Flow1 should only fork 5000 sub-flows.

Any idea what the problem is? Kind of a bummer for our POC :(

Thanks in advance for your help, -Frederic

monoc44 avatar Jul 31 '19 01:07 monoc44

While you can start 5000 children by increasing the timeout it is kind of anti pattern. We recommend around 1000 children per a workflow. If you need more use a tree approach. Le’t say a workflow starts 100 children and each of the children starts 100 children you end up with 10k children in a scalable manner.

The general pattern is that Cadence scales out as it can have practically unlimited number of workflows. It doesn’t scale up meaning that a single workflow size is limited.

mfateev avatar Aug 05 '19 22:08 mfateev