Flaky testShouldReturnQueryResultAfterWorkflowTimeout
Actual Behavior
java-sdk builds in buildkite started to fail often on flaky TerminatedWorkflowQueryTest#testShouldReturnQueryResultAfterWorkflowTimeout
Stacktrace from a locally reproduced failure:
io.temporal.client.WorkflowQueryException: workflowId='fa83465b-18c6-424c-a959-0259996b2876', runId='72baee4f-8b00-4914-aace-db0d505606ed', workflowType='testActivity'}
at io.temporal.internal.sync.WorkflowStubImpl.query(WorkflowStubImpl.java:349)
at io.temporal.testing.TestWorkflowEnvironmentInternal$TimeLockingInterceptor$TimeLockingWorkflowStub.query(TestWorkflowEnvironmentInternal.java:369)
at io.temporal.internal.sync.WorkflowInvocationHandler$SyncWorkflowInvocationHandler.queryWorkflow(WorkflowInvocationHandler.java:309)
at io.temporal.internal.sync.WorkflowInvocationHandler$SyncWorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:272)
at io.temporal.internal.sync.WorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:178)
at com.sun.proxy.$Proxy20.getTrace(Unknown Source)
at io.temporal.workflow.TerminatedWorkflowQueryTest.testShouldReturnQueryResultAfterWorkflowTimeout(TerminatedWorkflowQueryTest.java:64)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at io.temporal.testing.TestWorkflowRule$2.evaluate(TestWorkflowRule.java:273)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: java.lang.IllegalArgumentException: Unknown query type: getTrace, knownTypes=[]
at io.temporal.internal.sync.QueryDispatcher.handleQuery(QueryDispatcher.java:79)
at io.temporal.internal.sync.SyncWorkflowContext.handleQuery(SyncWorkflowContext.java:244)
at io.temporal.internal.sync.WorkflowExecuteRunnable.handleQuery(WorkflowExecuteRunnable.java:76)
at io.temporal.internal.sync.SyncWorkflow.query(SyncWorkflow.java:179)
at io.temporal.internal.replay.ReplayWorkflowExecutor.query(ReplayWorkflowExecutor.java:137)
at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleQueryWorkflowTask(ReplayWorkflowRunTaskHandler.java:255)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleQueryOnlyWorkflowTask(ReplayWorkflowTaskHandler.java:257)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:112)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:319)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:279)
at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.queryWorkflow(WorkflowServiceGrpc.java:2970)
at io.temporal.internal.external.GenericWorkflowClientExternalImpl.lambda$query$5(GenericWorkflowClientExternalImpl.java:216)
at io.temporal.serviceclient.GrpcRetryer.retryWithResult(GrpcRetryer.java:95)
at io.temporal.internal.external.GenericWorkflowClientExternalImpl.query(GenericWorkflowClientExternalImpl.java:209)
at io.temporal.internal.client.RootWorkflowClientInvoker.query(RootWorkflowClientInvoker.java:142)
at io.temporal.internal.sync.WorkflowStubImpl.query(WorkflowStubImpl.java:342)
... 19 more
Related to #391 and shows that the #391 issue wasn't addressed by #531
After some investigation, I've found that the issue is happening because when workflow task times out and we replay the history
EVENT_TYPE_WORKFLOW_EXECUTION_STARTED
EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
EVENT_TYPE_WORKFLOW_TASK_STARTED
EVENT_TYPE_WORKFLOW_TASK_FAILED
EVENT_TYPE_WORKFLOW_EXECUTION_TIMED_OUT
It creates a brand new QueryDispatcher object that doesn't have query handlers by default, and because we never run event loop during replay workflow doesn't get a chance to initialize query handlers again. Hence when we run a query after workflow timed out we see no query handlers available. Proposed fix is to trigger event loop somewhere during replay. Alternatively we could move out query handler initialization and call it earlier so that it gets triggered during the replay.
The result depends on if workflow run times out before workflow task heartbeat (local activity is used in that test) is received by the server or after.