ArithmeticException in ReplayWorkflowExecutor after some amount of requests
Test to reproduce
public class LocalActivityLongTest {
....
@WorkflowInterface
public interface TestWorkflow {
@WorkflowMethod
String workflow(String input);
}
public static class ActivityWorkflow implements TestWorkflow {
private final TestActivity activity = Workflow.newLocalActivityStub(TestActivity.class, LocalActivityOptions.newBuilder().build());
@Override
public String workflow(String input) {return activity.activity(input + "3");}
}
@ActivityInterface
public interface TestActivity {
String activity(String input);
}
private static class ActivityImpl implements TestActivity {
@Override
public String activity(String input) {return "1";}
}
@Test
public void trivialTest() {
Worker worker = testEnvironment.newWorker(TASK_QUEUE);
worker.registerWorkflowImplementationTypes(ActivityWorkflow.class);
worker.registerActivitiesImplementations(new ActivityImpl());
testEnvironment.start();
WorkflowClient client = testEnvironment.getWorkflowClient();
WorkflowOptions options =
WorkflowOptions.newBuilder()
// uncomment this to fix the test
// .setWorkflowExecutionTimeout(Duration.ofHours(1))
.setTaskQueue(TASK_QUEUE)
.build();
for (int reqCount = 1; reqCount < 1000; reqCount++) {
log.info("Request {}", reqCount);
TestWorkflow workflow = client.newWorkflowStub(TestWorkflow.class, options);
String result = workflow.workflow(UUID.randomUUID().toString());
}
}
}
Expected Behavior
Either to:
- fail with a meaningful message on the configuration stage
- or fail with a meaningful message on the first request
- or successfully execute all 1000 requests.
Actual Behavior
Executes the first ~30 requests successfully. Hangs and fails after that with:
23:49:49.698 [Workflow Executor taskQueue="test-workflow", namespace="default": 1] ERROR i.t.internal.worker.PollerOptions - uncaught exception
java.lang.RuntimeException: Failure processing workflow task. WorkflowId=304a033c-e0e6-49ec-9c2d-024c01d533a7, RunId=c63116f5-330a-4e35-9f04-6cdd344acea8
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:324)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:273)
at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:79)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArithmeticException: Duration of this magnitude not supported
at com.uber.m3.util.Duration.checkMultiplicationLongOverflow(Duration.java:340)
at com.uber.m3.util.Duration.ofMillis(Duration.java:109)
at io.temporal.internal.common.ProtobufTimeUtils.ToM3Duration(ProtobufTimeUtils.java:49)
at io.temporal.internal.replay.ReplayWorkflowExecutor.completeWorkflow(ReplayWorkflowExecutor.java:329)
at io.temporal.internal.replay.ReplayWorkflowExecutor.mayBeCompleteWorkflow(ReplayWorkflowExecutor.java:302)
at io.temporal.internal.replay.ReplayWorkflowExecutor.handleWorkflowTaskImpl(ReplayWorkflowExecutor.java:468)
at io.temporal.internal.replay.ReplayWorkflowExecutor.handleWorkflowTask(ReplayWorkflowExecutor.java:403)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:168)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowTaskHandler.java:145)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:104)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:301)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:273)
at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
Steps to Reproduce the Problem
Please find the full unit test here: https://github.com/Spikhalskiy/java-sdk/commit/821f735afa696d9abc350cc9fb1b12eef1e16ae7. Which hangs after about 30-35 requests with the exception in logs. Also, note that uncommenting https://github.com/Spikhalskiy/java-sdk/commit/821f735afa696d9abc350cc9fb1b12eef1e16ae7#diff-1661ff603d3c1aaff5d01537e8f938b7R79 fixes the unit test.
Specifications
- Version: at least 0.28 and current master [https://github.com/temporalio/java-sdk/commit/35abfd425c13e86f16b81e26c85c07b17ee3448f] is affected
The root cause is the unit test time skipping logic together with a large workflow timeout. The time skipping automatically moves time to the next task in a timer queue. The problem is that it does it for the workflow timeout after workflow completion. When running 1k workflows it jumps 10 years ahead after every execution. And apparently M3 metrics are not ready to deal with 10k year durations.
I'm looking into the fix. Probably we should ignore time skipping for tasks that belong to completed workflows.
There is a report that the similar exception appears in java-sdk 1.8.0 using test-server.
java.lang.ArithmeticException: Duration of this magnitude not supported
at com.uber.m3.util.Duration.checkMultiplicationLongOverflow(Duration.java:340)
at com.uber.m3.util.Duration.ofMillis(Duration.java:109)
at io.temporal.internal.common.ProtobufTimeUtils.toM3Duration(ProtobufTimeUtils.java:49)
at io.temporal.internal.replay.ReplayWorkflowExecutor.completeWorkflow(ReplayWorkflowExecutor.java:109)
at io.temporal.internal.replay.ReplayWorkflowExecutor.eventLoop(ReplayWorkflowExecutor.java:84)
at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler$StatesMachinesCallbackImpl.eventLoop(ReplayWorkflowRunTaskHandler.java:321)
at io.temporal.internal.statemachines.WorkflowStateMachines.eventLoop(WorkflowStateMachines.java:463)
at io.temporal.internal.statemachines.WorkflowStateMachines.lambda$scheduleLocalActivityTask$8d414bd$1(WorkflowStateMachines.java:790)
at io.temporal.internal.statemachines.LocalActivityStateMachine.notifyResultFromResponse(LocalActivityStateMachine.java:326)
at io.temporal.internal.statemachines.FixedTransitionAction.apply(FixedTransitionAction.java:45)
at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:147)
... 17 common frames omitted
Wrapped by: java.lang.RuntimeException: LocalActivity: failure executing MARKER_COMMAND_CREATED->RECORD_MARKER, transition history is [CREATED->CHECK_EXECUTION_STATE, EXECUTING->SCHEDULE, REQUEST_PREPARED->MARK_AS_SENT, REQUEST_SENT->HANDLE_RESULT]
at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:151)
at io.temporal.internal.statemachines.StateMachine.handleCommand(StateMachine.java:111)
at io.temporal.internal.statemachines.EntityStateMachineBase.handleCommand(EntityStateMachineBase.java:55)
at io.temporal.internal.statemachines.CancellableCommand.handleCommand(CancellableCommand.java:63)
at io.temporal.internal.statemachines.WorkflowStateMachines.prepareImpl(WorkflowStateMachines.java:413)
at io.temporal.internal.statemachines.WorkflowStateMachines.prepareCommands(WorkflowStateMachines.java:396)
at io.temporal.internal.statemachines.WorkflowStateMachines.handleLocalActivityCompletion(WorkflowStateMachines.java:768)
at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.waitAndProcessLocalActivityCompletion(ReplayWorkflowRunTaskHandler.java:307)
at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.processLocalActivityRequests(ReplayWorkflowRunTaskHandler.java:282)
at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:146)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:122)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:97)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:241)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:199)
at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
No known reproduction of this issue for more than a year. Reopen if there is a new report.