RPC timeout when forking thousands of subflows

Open monoc44 opened this issue 6 years ago • 1 comments

We created two flows (Flow1, Flow2) and one activity (Activity1) for testing Cadence.

Activity1 -> returns an integer
Flow1 -> calls Activity1 and uses the returned result to know how many sub-children of type Flow2 to create and run them asynchronously (for loop)
Flow2 -> returns "Hello World"

For testing the scalability of the system, we started:

docker-compose (up)
one worker configured to handle Flow1, Flow2 and Activity1 requests
set Activity1 to return 5000

What we observed:

Flow1 is created (status open in web client)
None of the 5000 expected Flow2 get created (at least none of them displayed in web client)
In Flow1, worker keeps iterating over the 5000 Flow2, then display an error log, and then re-iterate again (infinite loop, replay in action I suppose)
If we stop the worker and restarts it, it will go back to the infinite loop (replay)

Here is the error log:

[...]
18:15:02.792 INFO  nce.worflows.Flow1 - Starting Flow2-4895
18:15:02.792 INFO  nce.worflows.Flow1 - Starting Flow2-4896
18:15:02.792 INFO  nce.worflows.Flow1 - Starting Flow2-4897
18:15:02.793 INFO  nce.worflows.Flow1 - Starting Flow2-4898
18:15:02.793 INFO  nce.worflows.Flow1 - Starting Flow2-4899
18:15:03.870 WARN  adence.internal.common.Retryer -      - Retrying after failure
org.apache.thrift.TException: Rpc error:<ErrorResponse id=3 errorType=Timeout message=Request timeout after 1004ms>
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:505)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:480)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:918)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$10(WorkflowServiceTChannel.java:907)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:525)
	at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:905)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:302)
	at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
	at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
	at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:302)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:262)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:230)
	at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)

Note: On the Cassandra side, it seems that table EXECUTIONS gets new rows inserted at each iteration and hence keep increasing despite Flow1 should only fork 5000 sub-flows.

Any idea what the problem is? Kind of a bummer for our POC :(

Thanks in advance for your help, -Frederic

Jul 31 '19 01:07 monoc44

While you can start 5000 children by increasing the timeout it is kind of anti pattern. We recommend around 1000 children per a workflow. If you need more use a tree approach. Le’t say a workflow starts 100 children and each of the children starts 100 children you end up with 10k children in a scalable manner.

The general pattern is that Cadence scales out as it can have practically unlimited number of workflows. It doesn’t scale up meaning that a single workflow size is limited.

Aug 05 '19 22:08 mfateev