sdk-java icon indicating copy to clipboard operation
sdk-java copied to clipboard

Flaky testShouldReturnQueryResultAfterWorkflowTimeout

Open Spikhalskiy opened this issue 4 years ago • 3 comments

Actual Behavior

java-sdk builds in buildkite started to fail often on flaky TerminatedWorkflowQueryTest#testShouldReturnQueryResultAfterWorkflowTimeout

Stacktrace from a locally reproduced failure:

io.temporal.client.WorkflowQueryException: workflowId='fa83465b-18c6-424c-a959-0259996b2876', runId='72baee4f-8b00-4914-aace-db0d505606ed', workflowType='testActivity'}
	at io.temporal.internal.sync.WorkflowStubImpl.query(WorkflowStubImpl.java:349)
	at io.temporal.testing.TestWorkflowEnvironmentInternal$TimeLockingInterceptor$TimeLockingWorkflowStub.query(TestWorkflowEnvironmentInternal.java:369)
	at io.temporal.internal.sync.WorkflowInvocationHandler$SyncWorkflowInvocationHandler.queryWorkflow(WorkflowInvocationHandler.java:309)
	at io.temporal.internal.sync.WorkflowInvocationHandler$SyncWorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:272)
	at io.temporal.internal.sync.WorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:178)
	at com.sun.proxy.$Proxy20.getTrace(Unknown Source)
	at io.temporal.workflow.TerminatedWorkflowQueryTest.testShouldReturnQueryResultAfterWorkflowTimeout(TerminatedWorkflowQueryTest.java:64)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at io.temporal.testing.TestWorkflowRule$2.evaluate(TestWorkflowRule.java:273)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: java.lang.IllegalArgumentException: Unknown query type: getTrace, knownTypes=[]
	at io.temporal.internal.sync.QueryDispatcher.handleQuery(QueryDispatcher.java:79)
	at io.temporal.internal.sync.SyncWorkflowContext.handleQuery(SyncWorkflowContext.java:244)
	at io.temporal.internal.sync.WorkflowExecuteRunnable.handleQuery(WorkflowExecuteRunnable.java:76)
	at io.temporal.internal.sync.SyncWorkflow.query(SyncWorkflow.java:179)
	at io.temporal.internal.replay.ReplayWorkflowExecutor.query(ReplayWorkflowExecutor.java:137)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleQueryWorkflowTask(ReplayWorkflowRunTaskHandler.java:255)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleQueryOnlyWorkflowTask(ReplayWorkflowTaskHandler.java:257)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:112)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:319)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:279)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.queryWorkflow(WorkflowServiceGrpc.java:2970)
	at io.temporal.internal.external.GenericWorkflowClientExternalImpl.lambda$query$5(GenericWorkflowClientExternalImpl.java:216)
	at io.temporal.serviceclient.GrpcRetryer.retryWithResult(GrpcRetryer.java:95)
	at io.temporal.internal.external.GenericWorkflowClientExternalImpl.query(GenericWorkflowClientExternalImpl.java:209)
	at io.temporal.internal.client.RootWorkflowClientInvoker.query(RootWorkflowClientInvoker.java:142)
	at io.temporal.internal.sync.WorkflowStubImpl.query(WorkflowStubImpl.java:342)
	... 19 more

Spikhalskiy avatar Aug 15 '21 03:08 Spikhalskiy

Related to #391 and shows that the #391 issue wasn't addressed by #531

Spikhalskiy avatar Sep 08 '21 18:09 Spikhalskiy

After some investigation, I've found that the issue is happening because when workflow task times out and we replay the history

EVENT_TYPE_WORKFLOW_EXECUTION_STARTED
EVENT_TYPE_WORKFLOW_TASK_SCHEDULED
EVENT_TYPE_WORKFLOW_TASK_STARTED
EVENT_TYPE_WORKFLOW_TASK_FAILED
EVENT_TYPE_WORKFLOW_EXECUTION_TIMED_OUT

It creates a brand new QueryDispatcher object that doesn't have query handlers by default, and because we never run event loop during replay workflow doesn't get a chance to initialize query handlers again. Hence when we run a query after workflow timed out we see no query handlers available. Proposed fix is to trigger event loop somewhere during replay. Alternatively we could move out query handler initialization and call it earlier so that it gets triggered during the replay.

vitarb avatar Sep 16 '21 04:09 vitarb

The result depends on if workflow run times out before workflow task heartbeat (local activity is used in that test) is received by the server or after.

Spikhalskiy avatar Apr 23 '22 04:04 Spikhalskiy