camunda
camunda copied to clipboard
Endless processing state machine error loop
Describe the bug
Extracted from https://github.com/camunda/zeebe/issues/14929 and https://github.com/camunda/zeebe/issues/16104
When such cases produce an error, the processing state machine will try and handle it. Part of the error handling is to respond to the user with a rejection. However, in this case, we failed to serialize the record when appending the rejection, which causes a new error. This is then handled the same way, and simply loops forever.
To Reproduce
/*
* Copyright Camunda Services GmbH and/or licensed to Camunda Services GmbH under
* one or more contributor license agreements. See the NOTICE file distributed
* with this work for additional information regarding copyright ownership.
* Licensed under the Zeebe Community License 1.1. You may not use this file
* except in compliance with the Zeebe Community License 1.1.
*/
package io.camunda.zeebe.it.client.command;
import static org.assertj.core.api.Assertions.assertThat;
import io.camunda.zeebe.qa.util.cluster.TestStandaloneBroker;
import io.camunda.zeebe.qa.util.junit.ZeebeIntegration;
import io.camunda.zeebe.qa.util.junit.ZeebeIntegration.TestZeebe;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.concurrent.Future;
import org.junit.jupiter.api.Test;
@ZeebeIntegration
public class EndlessLoopTest {
@TestZeebe private final TestStandaloneBroker broker = new TestStandaloneBroker();
private final String dmn =
"""
<?xml version="1.0" encoding="UTF-8"?>
<definitions xmlns="https://www.omg.org/spec/DMN/20191111/MODEL/" id="_6f87b3aa-d613-4c41-93ad-832f021c8318" name="definitions" namespace="http://camunda.org/schema/1.0/dmn">
<decision id="_51711_44808585-3491-42a8-87ea-3c4fcda46eff" name="">
<decisionTable id="_07c6fdc3-7bdf-459a-a016-2ae12e474bd2">
<input id="_f00d0717-d08a-4918-ad51-568788e92038" label="Percent Responsible (Loan Roles)">
<inputExpression id="_0b922041-84c3-47bf-b94c-1dd423c394ae" typeRef="number">
<text>proposedLoan.loanRoles.percentResponsible</text>
</inputExpression>
</input>
<output id="_b304292d-98e5-4dde-8fbd-0de1981c97ea" label="" name="out" typeRef="string">
<outputValues id="UnaryTests_1yu7moy">
<text>"Approve","Decline","Review"</text>
</outputValues>
</output>
<rule id="DecisionRule_1j6jzzn">
<inputEntry id="UnaryTests_105fhm8">
<text></text>
</inputEntry>
<outputEntry id="LiteralExpression_1ucr6zl">
<text>"Approve"</text>
</outputEntry>
</rule>
</decisionTable>
</decision>
</definitions>
""";
@Test
void shouldNotLoopEndlessly() {
// given
try (final var client = broker.newClientBuilder().build()) {
// when
final Future<?> result =
client
.newDeployResourceCommand()
.addResourceBytes(dmn.getBytes(StandardCharsets.UTF_8), "bork.dmn")
.requestTimeout(Duration.ofSeconds(30))
.send();
// then - expand the assert to check that it's rejected appropriately within reasonable amount
// of time
assertThat(result).failsWithin(Duration.ofSeconds(10)).withThrowableThat();
}
}
}
Expected behavior
On error, attempt to handle it via the processor which caused it, then:
- Write a rejection and send it
- If this causes an error, then handle it by making it more generic: omit the record value (unfortunately) and make the record which caused it as failed.
- Send a generic error back to the gateway/client.
- If this causes an error, mark the record as failed (so we can skip it) but omit sending anything.
This affects all supported version as far as I can tell.
Please improve the observability of this feature/loop.
It is a bit tricky to handle this correctly and to cover most cases. Here is what I tried to do (POC).
If the command is a user command
- Engine tries to handle errors thrown in
onPrcossesingErrorby ignoring everything else and writing a rejection record with no value. This ensures that errors caused by writing value, or large record size will be prevented. But writes the detailed error message. https://github.com/camunda/zeebe/commit/244eff4966e5162ffa21cde05d01eaa9c1bea9e5 - If that also resulted in error -
ProcessingStateMachinewrites a rejection record with no value, and fixed error message with no details. https://github.com/camunda/zeebe/commit/be9427671ee7ff57ab1b527993d5af8675f1b4aa - If that also fails, we will be back in error loop. Since it is a single rejection record with no data, the chances of it failing is very less. If it fails, we might have a bigger problem and it would probably be ok to be in the error loop.
The POC branch is based on a commit before the bug fix, so we can verify that the above changes prevents endless loop with the above reproducer test.
Generally looks good :+1: I wish there was an easier way to detect the layers of the error handling loop than a random threshold :thinking:
a random threshold
Do you mean the magic number 3 in https://github.com/camunda/zeebe/commit/be9427671ee7ff57ab1b527993d5af8675f1b4aa ?
I'm not sure if it has to be 3 or more. I think the second iteration of error loop would work as well. But I have to look into that again to understand how onError is used overall.
I'm not sure how to add layers into error handling loop. onError can be called at different phases of processing. Also the recursive nature makes it difficult to understand in which step it failed. If you have some ideas, we can try to look at it together and see how to improve it.
I have now changed it to more a "phased" error handling https://github.com/camunda/zeebe/commit/314137c9141bc5930d6bc2a560cb25d8749661ac. It is not much different from the previous, instead of just checking the retry number, we have an enum to indicate the phase of the error loop. So it is easier to understand.
Also moved the extra handling in the Engine which I realized that it could be incorrect because the data changes are not rolled back if the error is handled in the Engine as I did in https://github.com/camunda/zeebe/commit/244eff4966e5162ffa21cde05d01eaa9c1bea9e5.
Next step is figuring out a way to unit test this. The POC branch is based on a branch with the original bug, but we cannot use it for testing anymore as the bug is fixed.
Also one open question: What should we do if everything we tried failed?
- Should we just skip the user command? Can we do that? Would that result in violating stream processor assumptions or guarantees?
- Or fall back to endless loop?
@npepinpe
Skipping would be amazing, but I think automatically skipping is possibly the wrong thing to do. It may not be entirely visible, and it may break multiple guarantees as to engine semantics. That said, skipping on demand may be very useful in certain situations. We have another issue to allow skipping specific records: https://github.com/camunda/zeebe/issues/16177
So we can consider skipping out of scope. I think if we fail entirely, we may want to fall back to the endless loop, and then we should make sure that it is clearly visible so as to speed up diagnosis time. That said, if there are certain errors we would already know lead to endless loops, we may want to just transition to set the partition status to dead? :thinking:
@deepthidevaki we had a case right now with 8.4.3 where we still entered the endless loop. It seems related to this error as symptom: https://console.cloud.google.com/errors/detail/CI7Z3fDnpK-ZOw;service=zeebe;time=P7D?project=camunda-saas-prod
Can you look into why we entered the error loop, and if it makes sense on Monday? Then just close this issue if it's expected or if there's nothing we can do.
The second issue mentioned https://github.com/camunda/zeebe/issues/16429#issuecomment-1959223921 should be fixed to ensure that we don't end up in an error loop.