cadence-java-client
cadence-java-client copied to clipboard
RPC timeout when forking thousands of subflows
We created two flows (Flow1, Flow2) and one activity (Activity1) for testing Cadence.
- Activity1 -> returns an integer
- Flow1 -> calls Activity1 and uses the returned result to know how many sub-children of type Flow2 to create and run them asynchronously (for loop)
- Flow2 -> returns "Hello World"
For testing the scalability of the system, we started:
- docker-compose (up)
- one worker configured to handle Flow1, Flow2 and Activity1 requests
- set Activity1 to return 5000
What we observed:
- Flow1 is created (status open in web client)
- None of the 5000 expected Flow2 get created (at least none of them displayed in web client)
- In Flow1, worker keeps iterating over the 5000 Flow2, then display an error log, and then re-iterate again (infinite loop, replay in action I suppose)
- If we stop the worker and restarts it, it will go back to the infinite loop (replay)
Here is the error log:
[...]
18:15:02.792 INFO nce.worflows.Flow1 - Starting Flow2-4895
18:15:02.792 INFO nce.worflows.Flow1 - Starting Flow2-4896
18:15:02.792 INFO nce.worflows.Flow1 - Starting Flow2-4897
18:15:02.793 INFO nce.worflows.Flow1 - Starting Flow2-4898
18:15:02.793 INFO nce.worflows.Flow1 - Starting Flow2-4899
18:15:03.870 WARN adence.internal.common.Retryer - - Retrying after failure
org.apache.thrift.TException: Rpc error:<ErrorResponse id=3 errorType=Timeout message=Request timeout after 1004ms>
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:505)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:480)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:918)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$10(WorkflowServiceTChannel.java:907)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:525)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:905)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:302)
at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:302)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:262)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:230)
at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
Note: On the Cassandra side, it seems that table EXECUTIONS gets new rows inserted at each iteration and hence keep increasing despite Flow1 should only fork 5000 sub-flows.
Any idea what the problem is? Kind of a bummer for our POC :(
Thanks in advance for your help, -Frederic
While you can start 5000 children by increasing the timeout it is kind of anti pattern. We recommend around 1000 children per a workflow. If you need more use a tree approach. Le’t say a workflow starts 100 children and each of the children starts 100 children you end up with 10k children in a scalable manner.
The general pattern is that Cadence scales out as it can have practically unlimited number of workflows. It doesn’t scale up meaning that a single workflow size is limited.