camunda icon indicating copy to clipboard operation
camunda copied to clipboard

Endless processing state machine error loop

Open npepinpe opened this issue 1 year ago • 9 comments

Describe the bug

Extracted from https://github.com/camunda/zeebe/issues/14929 and https://github.com/camunda/zeebe/issues/16104

When such cases produce an error, the processing state machine will try and handle it. Part of the error handling is to respond to the user with a rejection. However, in this case, we failed to serialize the record when appending the rejection, which causes a new error. This is then handled the same way, and simply loops forever.

To Reproduce

/*
 * Copyright Camunda Services GmbH and/or licensed to Camunda Services GmbH under
 * one or more contributor license agreements. See the NOTICE file distributed
 * with this work for additional information regarding copyright ownership.
 * Licensed under the Zeebe Community License 1.1. You may not use this file
 * except in compliance with the Zeebe Community License 1.1.
 */
package io.camunda.zeebe.it.client.command;

import static org.assertj.core.api.Assertions.assertThat;

import io.camunda.zeebe.qa.util.cluster.TestStandaloneBroker;
import io.camunda.zeebe.qa.util.junit.ZeebeIntegration;
import io.camunda.zeebe.qa.util.junit.ZeebeIntegration.TestZeebe;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.concurrent.Future;
import org.junit.jupiter.api.Test;

@ZeebeIntegration
public class EndlessLoopTest {
  @TestZeebe private final TestStandaloneBroker broker = new TestStandaloneBroker();

  private final String dmn =
      """
<?xml version="1.0" encoding="UTF-8"?>
<definitions xmlns="https://www.omg.org/spec/DMN/20191111/MODEL/" id="_6f87b3aa-d613-4c41-93ad-832f021c8318" name="definitions" namespace="http://camunda.org/schema/1.0/dmn">
  <decision id="_51711_44808585-3491-42a8-87ea-3c4fcda46eff" name="">
    <decisionTable id="_07c6fdc3-7bdf-459a-a016-2ae12e474bd2">
      <input id="_f00d0717-d08a-4918-ad51-568788e92038" label="Percent Responsible (Loan Roles)">
        <inputExpression id="_0b922041-84c3-47bf-b94c-1dd423c394ae" typeRef="number">
          <text>proposedLoan.loanRoles.percentResponsible</text>
        </inputExpression>
      </input>
      <output id="_b304292d-98e5-4dde-8fbd-0de1981c97ea" label="" name="out" typeRef="string">
        <outputValues id="UnaryTests_1yu7moy">
          <text>"Approve","Decline","Review"</text>
        </outputValues>
      </output>
      <rule id="DecisionRule_1j6jzzn">
        <inputEntry id="UnaryTests_105fhm8">
          <text></text>
        </inputEntry>
        <outputEntry id="LiteralExpression_1ucr6zl">
          <text>"Approve"</text>
        </outputEntry>
      </rule>
    </decisionTable>
  </decision>
</definitions>
""";

  @Test
  void shouldNotLoopEndlessly() {
    // given
    try (final var client = broker.newClientBuilder().build()) {
      // when
      final Future<?> result =
          client
              .newDeployResourceCommand()
              .addResourceBytes(dmn.getBytes(StandardCharsets.UTF_8), "bork.dmn")
              .requestTimeout(Duration.ofSeconds(30))
              .send();

      // then - expand the assert to check that it's rejected appropriately within reasonable amount
      // of time
      assertThat(result).failsWithin(Duration.ofSeconds(10)).withThrowableThat();
    }
  }
}

Expected behavior

On error, attempt to handle it via the processor which caused it, then:

  1. Write a rejection and send it
  2. If this causes an error, then handle it by making it more generic: omit the record value (unfortunately) and make the record which caused it as failed.
  3. Send a generic error back to the gateway/client.
  4. If this causes an error, mark the record as failed (so we can skip it) but omit sending anything.

This affects all supported version as far as I can tell.

npepinpe avatar Jan 26 '24 12:01 npepinpe

Please improve the observability of this feature/loop.

npepinpe avatar Jan 29 '24 10:01 npepinpe

It is a bit tricky to handle this correctly and to cover most cases. Here is what I tried to do (POC).

If the command is a user command

  1. Engine tries to handle errors thrown in onPrcossesingError by ignoring everything else and writing a rejection record with no value. This ensures that errors caused by writing value, or large record size will be prevented. But writes the detailed error message. https://github.com/camunda/zeebe/commit/244eff4966e5162ffa21cde05d01eaa9c1bea9e5
  2. If that also resulted in error - ProcessingStateMachine writes a rejection record with no value, and fixed error message with no details. https://github.com/camunda/zeebe/commit/be9427671ee7ff57ab1b527993d5af8675f1b4aa
  3. If that also fails, we will be back in error loop. Since it is a single rejection record with no data, the chances of it failing is very less. If it fails, we might have a bigger problem and it would probably be ok to be in the error loop.

The POC branch is based on a commit before the bug fix, so we can verify that the above changes prevents endless loop with the above reproducer test.

deepthidevaki avatar Feb 01 '24 09:02 deepthidevaki

Generally looks good :+1: I wish there was an easier way to detect the layers of the error handling loop than a random threshold :thinking:

npepinpe avatar Feb 02 '24 15:02 npepinpe

a random threshold

Do you mean the magic number 3 in https://github.com/camunda/zeebe/commit/be9427671ee7ff57ab1b527993d5af8675f1b4aa ?

I'm not sure if it has to be 3 or more. I think the second iteration of error loop would work as well. But I have to look into that again to understand how onError is used overall.

I'm not sure how to add layers into error handling loop. onError can be called at different phases of processing. Also the recursive nature makes it difficult to understand in which step it failed. If you have some ideas, we can try to look at it together and see how to improve it.

deepthidevaki avatar Feb 02 '24 15:02 deepthidevaki

I have now changed it to more a "phased" error handling https://github.com/camunda/zeebe/commit/314137c9141bc5930d6bc2a560cb25d8749661ac. It is not much different from the previous, instead of just checking the retry number, we have an enum to indicate the phase of the error loop. So it is easier to understand.

Also moved the extra handling in the Engine which I realized that it could be incorrect because the data changes are not rolled back if the error is handled in the Engine as I did in https://github.com/camunda/zeebe/commit/244eff4966e5162ffa21cde05d01eaa9c1bea9e5.

Next step is figuring out a way to unit test this. The POC branch is based on a branch with the original bug, but we cannot use it for testing anymore as the bug is fixed.

deepthidevaki avatar Feb 08 '24 14:02 deepthidevaki

Also one open question: What should we do if everything we tried failed?

  • Should we just skip the user command? Can we do that? Would that result in violating stream processor assumptions or guarantees?
  • Or fall back to endless loop?

@npepinpe

deepthidevaki avatar Feb 08 '24 14:02 deepthidevaki

Skipping would be amazing, but I think automatically skipping is possibly the wrong thing to do. It may not be entirely visible, and it may break multiple guarantees as to engine semantics. That said, skipping on demand may be very useful in certain situations. We have another issue to allow skipping specific records: https://github.com/camunda/zeebe/issues/16177

So we can consider skipping out of scope. I think if we fail entirely, we may want to fall back to the endless loop, and then we should make sure that it is clearly visible so as to speed up diagnosis time. That said, if there are certain errors we would already know lead to endless loops, we may want to just transition to set the partition status to dead? :thinking:

npepinpe avatar Feb 08 '24 16:02 npepinpe

@deepthidevaki we had a case right now with 8.4.3 where we still entered the endless loop. It seems related to this error as symptom: https://console.cloud.google.com/errors/detail/CI7Z3fDnpK-ZOw;service=zeebe;time=P7D?project=camunda-saas-prod

Can you look into why we entered the error loop, and if it makes sense on Monday? Then just close this issue if it's expected or if there's nothing we can do.

npepinpe avatar Feb 16 '24 17:02 npepinpe

The second issue mentioned https://github.com/camunda/zeebe/issues/16429#issuecomment-1959223921 should be fixed to ensure that we don't end up in an error loop.

deepthidevaki avatar Feb 22 '24 15:02 deepthidevaki