sdk-java icon indicating copy to clipboard operation
sdk-java copied to clipboard

Exception in one of several "parallel" workflow async functions leads to non-deterministic execution

Open Spikhalskiy opened this issue 3 years ago • 0 comments

Expected Behavior

Temporal JavaSDK explicitly supports the idea of fixing a bug leading to an exception in the workflow implementation. Workflow can throw a non ApplicationFailure exception and such an exception can be fixed in code, the worker can be redeployed and the workflow should be able to successfully continue the existing execution without resetting the history.

The following code simulates a situation when a second asynchronously triggered procedure throws an exception on the first execution and the problem is getting "fixed" on the second attempt. You would expect that this code will be able to successfully finish on a second attempt.

    public String execute(String input) {
      List<Promise<Void>> processingPromises = new ArrayList<>();
      processingPromises.add(
          Async.procedure(this::procedure, "input1", false, Workflow.isReplaying()));
      processingPromises.add(
          Async.procedure(this::procedure, "input2", true, Workflow.isReplaying()));
      processingPromises.forEach(Promise::get);
      return "done";
    }

    private void procedure(String obj, boolean failingOnTheFirstRun, boolean retrying) {
      ActivityAlphabet activity = Workflow.newActivityStub(ActivityAlphabet.class);
      
      // only the second Async procedure fails and only during the initial run 
      if (failingOnTheFirstRun && !retrying) {
        // emulating fail with some implementation bug
        throw new NullPointerException();
      }
      activity.doA(obj);
      activity.doB(obj);
    }

Actual Behavior

Non-deterministic exception:

Caused by: io.temporal.worker.NonDeterministicException: Failure handling event 11 of type 'EVENT_TYPE_ACTIVITY_TASK_SCHEDULED' during replay. Command COMMAND_TYPE_SCHEDULE_ACTIVITY_TASK doesn't match event EVENT_TYPE_ACTIVITY_TASK_SCHEDULED with EventId=11 on check activityType with an expected value name: "DoA"
 and an actual value name: "DoB"
. {PreviousStartedEventId=9, workflowTaskStartedEventId=18, Currently Processing StartedEventId=9}
	at io.temporal.internal.statemachines.WorkflowStateMachines.assertMatch(WorkflowStateMachines.java:887)
	at io.temporal.internal.statemachines.WorkflowStateMachines.validateCommand(WorkflowStateMachines.java:817)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleCommandEvent(WorkflowStateMachines.java:338)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:229)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:200)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:176)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:176)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:145)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:127)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:102)

Test reproducing the problem: https://github.com/Spikhalskiy/java-sdk/commit/43ce60f9518fc1acb871a73aafab6df2a57749df#diff-96be05397e0cfb9f8d1bdc02587f70a34e65e1b42b8e908964162f39d27f249aR46

Root cause analysis

  1. Async procedure can't propagate the exception on the main worklow thead when it happens, the exception "stays in the Promise" until it's checked.
  2. Workflow code is awaiting for the first promise from the non-failing procedure before starting the await on the second failing promise.
  3. Failure of the second procedure is getting ignored for some time and the first procedure gets a chance to advance and be fully complete, produce a couple of completed activity executions before we realize that the second procedure is failed already.
  4. These "unlawfully" advanced first procedure and it's scheduled activities recorded in the history create a history that is incompatible with "fixed" code and can't be replayed by a "fixed" version.

The reason this happens is eager evaluation. As soon as the promises are created they begin evaluating. Now imagine the following sequence of events:

First execution

  1. Promises 1&2 start running
  2. Promise 1 schedules A
  3. Promise 2 throws, but exception is not propagated up yet
  4. Promise 1 schedules B
  5. We send 1A and 1B to server Next go-round we blow up w/ the exception

Expected fixed execution

  1. Promises 1&2 start running
  2. Promise 1 schedules A
  3. Promise 2 schedules A because it is fixed now //<-- Failing here now
  4. Promise 1 schedules B
  5. Promise 2 schedules B

Oops now 2A is out-of-order according to existing history. 2B would be OK since there was nothing before it besides the task failure in history. So, we are failing after step 3 with non-deterministic error because the command 2A doesn't match the expected event 1B from the history.

Workaround

  1. Using Promise.allOf to wait on all the promises doesn't allow the exception to be "ignored" because we wait only on the one promise at a time. Has to be implemented correctly in this way before the problem actually happened and the incorrect history is already produced.
  2. Promoting all exceptions happening in Async functions/procedures to Errors solves this by immediately failing the workflow task. This has to be done before the workflow execution and before the incompatible history is already produced.

The only solution if the problem already manifested itself is history reset + one of the workarounds to prevent it from happening again.

Notes about the original user report

Investigation triggered by: https://community.temporal.io/t/verifying-error-handling-and-fixing-buggy-code-within-async-procedure-promises/3391/10

Which leaded to creation of simplified flaky reproduction https://github.com/Spikhalskiy/java-sdk/commit/eb6b44142fa5f2fb31809dace8aedb435307430b that flakes about 5% of the time and requires WorkflowImplZRetriesTestCopy1 to be executed after WorkflowImplRetriesTest to increase the chance of the problem happening.

The original flaky reproduction may be very hard to follow and link to the described issue, because the problem occurs there only under a specific ordering of the activity executions and completions that leads to an effective "reordering" of asynchronous procedures execution.

Spikhalskiy avatar Nov 28 '21 20:11 Spikhalskiy