sdk-java icon indicating copy to clipboard operation
sdk-java copied to clipboard

ArithmeticException in ReplayWorkflowExecutor after some amount of requests

Open Spikhalskiy opened this issue 5 years ago • 2 comments

Test to reproduce

public class LocalActivityLongTest {
  ....
  @WorkflowInterface
  public interface TestWorkflow {
    @WorkflowMethod
    String workflow(String input);
  }

  public static class ActivityWorkflow implements TestWorkflow {
    private final TestActivity activity = Workflow.newLocalActivityStub(TestActivity.class, LocalActivityOptions.newBuilder().build());
    @Override
    public String workflow(String input) {return activity.activity(input + "3");}
  }

  @ActivityInterface
  public interface TestActivity {
    String activity(String input);
  }

  private static class ActivityImpl implements TestActivity {
    @Override
    public String activity(String input) {return "1";}
  }

  @Test
  public void trivialTest() {
    Worker worker = testEnvironment.newWorker(TASK_QUEUE);
    worker.registerWorkflowImplementationTypes(ActivityWorkflow.class);
    worker.registerActivitiesImplementations(new ActivityImpl());

    testEnvironment.start();
    WorkflowClient client = testEnvironment.getWorkflowClient();
    WorkflowOptions options =
        WorkflowOptions.newBuilder()
            // uncomment this to fix the test
            // .setWorkflowExecutionTimeout(Duration.ofHours(1))
            .setTaskQueue(TASK_QUEUE)
            .build();

    for (int reqCount = 1; reqCount < 1000; reqCount++) {
      log.info("Request {}", reqCount);
      TestWorkflow workflow = client.newWorkflowStub(TestWorkflow.class, options);
      String result = workflow.workflow(UUID.randomUUID().toString());
    }
  }
}

Expected Behavior

Either to:

  1. fail with a meaningful message on the configuration stage
  2. or fail with a meaningful message on the first request
  3. or successfully execute all 1000 requests.

Actual Behavior

Executes the first ~30 requests successfully. Hangs and fails after that with:

23:49:49.698 [Workflow Executor taskQueue="test-workflow", namespace="default": 1] ERROR i.t.internal.worker.PollerOptions - uncaught exception
java.lang.RuntimeException: Failure processing workflow task. WorkflowId=304a033c-e0e6-49ec-9c2d-024c01d533a7, RunId=c63116f5-330a-4e35-9f04-6cdd344acea8
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:324)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:273)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:79)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArithmeticException: Duration of this magnitude not supported
	at com.uber.m3.util.Duration.checkMultiplicationLongOverflow(Duration.java:340)
	at com.uber.m3.util.Duration.ofMillis(Duration.java:109)
	at io.temporal.internal.common.ProtobufTimeUtils.ToM3Duration(ProtobufTimeUtils.java:49)
	at io.temporal.internal.replay.ReplayWorkflowExecutor.completeWorkflow(ReplayWorkflowExecutor.java:329)
	at io.temporal.internal.replay.ReplayWorkflowExecutor.mayBeCompleteWorkflow(ReplayWorkflowExecutor.java:302)
	at io.temporal.internal.replay.ReplayWorkflowExecutor.handleWorkflowTaskImpl(ReplayWorkflowExecutor.java:468)
	at io.temporal.internal.replay.ReplayWorkflowExecutor.handleWorkflowTask(ReplayWorkflowExecutor.java:403)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:168)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowTaskHandler.java:145)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:104)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:301)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:273)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)

Steps to Reproduce the Problem

Please find the full unit test here: https://github.com/Spikhalskiy/java-sdk/commit/821f735afa696d9abc350cc9fb1b12eef1e16ae7. Which hangs after about 30-35 requests with the exception in logs. Also, note that uncommenting https://github.com/Spikhalskiy/java-sdk/commit/821f735afa696d9abc350cc9fb1b12eef1e16ae7#diff-1661ff603d3c1aaff5d01537e8f938b7R79 fixes the unit test.

Specifications

  • Version: at least 0.28 and current master [https://github.com/temporalio/java-sdk/commit/35abfd425c13e86f16b81e26c85c07b17ee3448f] is affected

Spikhalskiy avatar Aug 08 '20 03:08 Spikhalskiy

The root cause is the unit test time skipping logic together with a large workflow timeout. The time skipping automatically moves time to the next task in a timer queue. The problem is that it does it for the workflow timeout after workflow completion. When running 1k workflows it jumps 10 years ahead after every execution. And apparently M3 metrics are not ready to deal with 10k year durations.

I'm looking into the fix. Probably we should ignore time skipping for tasks that belong to completed workflows.

mfateev avatar Aug 21 '20 19:08 mfateev

There is a report that the similar exception appears in java-sdk 1.8.0 using test-server.

java.lang.ArithmeticException: Duration of this magnitude not supported
	at com.uber.m3.util.Duration.checkMultiplicationLongOverflow(Duration.java:340)
	at com.uber.m3.util.Duration.ofMillis(Duration.java:109)
	at io.temporal.internal.common.ProtobufTimeUtils.toM3Duration(ProtobufTimeUtils.java:49)
	at io.temporal.internal.replay.ReplayWorkflowExecutor.completeWorkflow(ReplayWorkflowExecutor.java:109)
	at io.temporal.internal.replay.ReplayWorkflowExecutor.eventLoop(ReplayWorkflowExecutor.java:84)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler$StatesMachinesCallbackImpl.eventLoop(ReplayWorkflowRunTaskHandler.java:321)
	at io.temporal.internal.statemachines.WorkflowStateMachines.eventLoop(WorkflowStateMachines.java:463)
	at io.temporal.internal.statemachines.WorkflowStateMachines.lambda$scheduleLocalActivityTask$8d414bd$1(WorkflowStateMachines.java:790)
	at io.temporal.internal.statemachines.LocalActivityStateMachine.notifyResultFromResponse(LocalActivityStateMachine.java:326)
	at io.temporal.internal.statemachines.FixedTransitionAction.apply(FixedTransitionAction.java:45)
	at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:147)
	... 17 common frames omitted
Wrapped by: java.lang.RuntimeException: LocalActivity: failure executing MARKER_COMMAND_CREATED->RECORD_MARKER, transition history is [CREATED->CHECK_EXECUTION_STATE, EXECUTING->SCHEDULE, REQUEST_PREPARED->MARK_AS_SENT, REQUEST_SENT->HANDLE_RESULT]
	at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:151)
	at io.temporal.internal.statemachines.StateMachine.handleCommand(StateMachine.java:111)
	at io.temporal.internal.statemachines.EntityStateMachineBase.handleCommand(EntityStateMachineBase.java:55)
	at io.temporal.internal.statemachines.CancellableCommand.handleCommand(CancellableCommand.java:63)
	at io.temporal.internal.statemachines.WorkflowStateMachines.prepareImpl(WorkflowStateMachines.java:413)
	at io.temporal.internal.statemachines.WorkflowStateMachines.prepareCommands(WorkflowStateMachines.java:396)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleLocalActivityCompletion(WorkflowStateMachines.java:768)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.waitAndProcessLocalActivityCompletion(ReplayWorkflowRunTaskHandler.java:307)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.processLocalActivityRequests(ReplayWorkflowRunTaskHandler.java:282)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:146)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:122)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:97)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:241)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:199)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Spikhalskiy avatar Feb 11 '22 23:02 Spikhalskiy

No known reproduction of this issue for more than a year. Reopen if there is a new report.

Spikhalskiy avatar Apr 18 '23 17:04 Spikhalskiy