camel-quarkus icon indicating copy to clipboard operation
camel-quarkus copied to clipboard

Intermittent failure of FileTests

Open jamesnetherton opened this issue 3 years ago • 22 comments

Regular offender for CI failures:

2022-03-01T12:54:16.4609416Z org.awaitility.core.ConditionTimeoutException: org.apache.camel.quarkus.component.file.it.FileTest expected "Hello1" but was "" within 10 seconds.
2022-03-01T12:54:16.4610283Z 	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:164)
2022-03-01T12:54:16.4611019Z 	at org.awaitility.core.AbstractHamcrestCondition.await(AbstractHamcrestCondition.java:86)
2022-03-01T12:54:16.4612118Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:939)
2022-03-01T12:54:16.4612630Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:645)
2022-03-01T12:54:16.4613190Z 	at org.apache.camel.quarkus.component.file.it.FileTest.idempotent(FileTest.java:137)
2022-03-01T12:54:16.4613758Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2022-03-01T12:54:16.4614376Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)

jamesnetherton avatar Mar 02 '22 09:03 jamesnetherton

The same kind of issue happens in FileTest.filter:

org.awaitility.core.ConditionTimeoutException: org.apache.camel.quarkus.component.file.it.FileTest expected "Hello2" but was "" within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:164)
	at org.awaitility.core.AbstractHamcrestCondition.await(AbstractHamcrestCondition.java:86)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:939)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:645)
	at org.apache.camel.quarkus.component.file.it.FileTest.filter(FileTest.java:170)

After investigations, the race condition could be made more probable with manual changes below:

         String result = mockEndpoint.getExchanges().stream().map(e -> e.getIn().getBody(String.class))
                 .collect(Collectors.joining(SEPARATOR));
+        Log.info("Calling getFromMock/" + mockId + " result = " + result.length());
 
-        mockEndpoint.reset();
+        Thread.sleep(600);
+        //mockEndpoint.reset();
 
         return result;
     }
diff --git a/integration-tests/file/src/main/java/org/apache/camel/quarkus/component/file/it/FileRoutes.java b/integration-tests/file/src/main/java/org/apache/camel/quarkus/component/file/it/FileRoutes.java
index 891339c..b03ab4a 100644
--- a/integration-tests/file/src/main/java/org/apache/camel/quarkus/component/file/it/FileRoutes.java
+++ b/integration-tests/file/src/main/java/org/apache/camel/quarkus/component/file/it/FileRoutes.java
@@ -67,7 +67,7 @@ public class FileRoutes extends RouteBuilder {
                 .convertBodyTo(String.class).to("mock:idempotent");
 
         bindToRegistry("myFilter", new MyFileFilter<>());
-        from(("file://target/filter?initialDelay=0&delay=10&filter=#myFilter"))
+        from(("file://target/filter?initialDelay=2500&delay=10&filter=#myFilter"))

At the end of the day, it could happen that result has length 0, right after the mock receive an exchange, but then mockEndpoint.reset() is called, thus clearing the received exchange from the mock. As such, the race condition triggers a missed exchange that break the testing logic.

aldettinger avatar Mar 07 '22 10:03 aldettinger

It seems to still be there after 2.8.0-M1.

At least, we have seen it once in a Windows build:

org.awaitility.core.ConditionTimeoutException: org.apache.camel.quarkus.component.file.it.FileTest expected "Hello2" but was "" within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:164)
	at org.awaitility.core.AbstractHamcrestCondition.await(AbstractHamcrestCondition.java:86)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:939)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:645)
	at org.apache.camel.quarkus.component.file.it.FileTest.filter(FileTest.java:157)

Let records few occurences to see how frequent it is and whether it still happen on Linux or not.

aldettinger avatar Mar 22 '22 08:03 aldettinger

We had another occurence of this in https://github.com/apache/camel-quarkus/runs/5790386191?check_suite_focus=true :

2022-04-01T15:54:17.3802934Z org.awaitility.core.ConditionTimeoutException: org.apache.camel.quarkus.component.file.it.FileTest expected "Hello2" but was "" within 10 seconds.
2022-04-01T15:54:17.3803866Z 	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
2022-04-01T15:54:17.3804696Z 	at org.awaitility.core.AbstractHamcrestCondition.await(AbstractHamcrestCondition.java:86)
2022-04-01T15:54:17.3805481Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
2022-04-01T15:54:17.3806537Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:691)
2022-04-01T15:54:17.3807215Z 	at org.apache.camel.quarkus.component.file.it.FileTest.filter(FileTest.java:157)

This time, it was not Linux platform with JDK version 17. Chances are that this is an infrequent flaky test on all platforms, jdk versions.

aldettinger avatar Apr 06 '22 09:04 aldettinger

Caught another occurence in integration-tests-alternative-jdk, this time in FileTest.idempotent:

2022-05-05T17:20:42.1752186Z org.awaitility.core.ConditionTimeoutException: org.apache.camel.quarkus.component.file.it.FileTest expected "Hello1" but was "" within 10 seconds.
2022-05-05T17:20:42.1752848Z 	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
2022-05-05T17:20:42.1760375Z 	at org.awaitility.core.AbstractHamcrestCondition.await(AbstractHamcrestCondition.java:86)
2022-05-05T17:20:42.1761109Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
2022-05-05T17:20:42.1769491Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:691)
2022-05-05T17:20:42.1770136Z 	at org.apache.camel.quarkus.component.file.it.FileTest.idempotent(FileTest.java:118)

aldettinger avatar May 06 '22 07:05 aldettinger

At this stage, we've experienced this flakiness once on alternative-platform, few times on alternative-jdks and never on regular itests. And as of now, the failing test is either idempotent or filter.

aldettinger avatar May 12 '22 12:05 aldettinger

This PR introduces extra logs to attempt catching more information.

The flaky tests caught, idempotent and filter, are both using getFromMock. However, only idempotent is using resetMock, so we may be facing a new issue. On next occurence, it would be of special interest to check what files are really consumed by the filter and idempotent routes.

aldettinger avatar May 13 '22 08:05 aldettinger

Seems like an heisenbug. Let's wait few more days, but as of now FileTests is not flaky anymore with those extra logs. If not conclusive, the next track would be to try setting quarkus.vertx.worker-pool-size=1 to check whether running all HTTP requests on the same threads would help.

aldettinger avatar Jun 02 '22 15:06 aldettinger

A month without issue, I've removed some console logs to restore flakiness. On next occurence:

  • Check the output for logs from .log("CAMEL-QUARKUS-3584 : 2 routes
  • Try with quarkus.vertx.worker-pool-size=1 to check whether running all HTTP requests on the same threads would help.

aldettinger avatar Jun 13 '22 15:06 aldettinger

Caught another occurences with logs below:

[main_upstream @ integration-tests-alternative-jdk]$ grep -R  CAMEL-QUARKUS-3584
9_cd integration-tests && mvn clean verify.txt:2022-06-17T02:58:50.0154647Z 2022-06-17 02:58:50,014 INFO  [route4] (Camel (camel-1) thread #5 - file://target/filter) CAMEL-QUARKUS-3584 =>  from("file://target/filter?initialDelay=0&delay=10&filter=#myFilter")... consumed message Hello2
9_cd integration-tests && mvn clean verify.txt:2022-06-17T02:59:02.1921702Z 2022-06-17 02:59:02,190 INFO  [route3] (Camel (camel-1) thread #4 - file://target/idempotent) CAMEL-QUARKUS-3584 =>  from("file://target/idempotent?idempotent=true&move=done/57FB490B1C9DAC6-0000000000000017&initialDelay=0&delay=10")... consumed message Hello1
9_cd integration-tests && mvn clean verify.txt:2022-06-17T02:59:02.3313226Z 2022-06-17 02:59:02,329 INFO  [route3] (Camel (camel-1) thread #4 - file://target/idempotent) CAMEL-QUARKUS-3584 =>  from("file://target/idempotent?idempotent=true&move=done/57FB490B1C9DAC6-0000000000000019&initialDelay=0&delay=10")... consumed message 

Actually, the last message consumed is empty while it should have content. Atomic move may improve the situation. However, not sure it would explain while filter would be flaky. Let's see if atomic move explains the situation and whether there are still other flakiness.

aldettinger avatar Jun 17 '22 09:06 aldettinger

The pr is no more flaky since almost 4 weeks. Let's remove the last logs to ensure flakiness is fixed. Worst case, we could reopen yet another time.

aldettinger avatar Jul 12 '22 14:07 aldettinger

Actually, it's still there: https://github.com/apache/camel-quarkus/runs/7760617856?check_suite_focus=true#step:8:347

The most probable explanation is that the logs in the camel route were lowering the probability of yet another flakiness issue.

aldettinger avatar Aug 10 '22 07:08 aldettinger

Yet another occurence, in the idempotent test this time: https://github.com/apache/camel-quarkus/runs/7989610885?check_suite_focus=true#step:8:7465

aldettinger avatar Aug 24 '22 09:08 aldettinger

There is no way to reproduce locally. The most probable scenario is that filter and idempotent test are somehow polluting each other. Let's restore filter => we should observe the failure in idempotent or filter test. And then, probably in a next step we could disable idempotent => we should observe no more failure. That way, we could have indication that the issue is linked to this couple of tests.

aldettinger avatar Feb 01 '23 12:02 aldettinger

Ok, caught 2x occurences of idempotent failing test (windows test):

org.awaitility.core.ConditionTimeoutException: org.apache.camel.quarkus.component.file.it.FileTest expected "Hello1" but was "" within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
	at org.awaitility.core.AbstractHamcrestCondition.await(AbstractHamcrestCondition.java:86)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:691)
	at org.apache.camel.quarkus.component.file.it.FileTest.idempotent(FileTest.java:119)

aldettinger avatar Feb 10 '23 17:02 aldettinger

Another occurrence but this time with filter test failing (windows test):

2023-04-12T11:24:53.4282279Z [ERROR] Tests run: 10, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 23.569 s <<< FAILURE! - in org.apache.camel.quarkus.component.file.it.FileTest
2023-04-12T11:24:53.4284199Z [ERROR] filter  Time elapsed: 10.227 s  <<< ERROR!
2023-04-12T11:24:53.4386611Z org.awaitility.core.ConditionTimeoutException: org.apache.camel.quarkus.component.file.it.FileTest expected "Hello2" but was "" within 10 seconds.
2023-04-12T11:24:53.4388402Z 	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
2023-04-12T11:24:53.4401840Z 	at org.awaitility.core.AbstractHamcrestCondition.await(AbstractHamcrestCondition.java:86)
2023-04-12T11:24:53.4403033Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
2023-04-12T11:24:53.4404071Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:691)
2023-04-12T11:24:53.4455333Z 	at org.apache.camel.quarkus.component.file.it.FileTest.filter(FileTest.java:158)
2023-04-12T11:24:53.4456538Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

So the issue could really occur on both filter and idempotent. Now let's try to remove idempotent and check that filter is no more failing.

aldettinger avatar Apr 12 '23 14:04 aldettinger

Actually, the filter test has failed once:

2023-05-10T10:39:01.1963304Z [ERROR] Tests run: 10, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 23.135 s <<< FAILURE! - in org.apache.camel.quarkus.component.file.it.FileTest
2023-05-10T10:39:01.2016829Z [ERROR] filter  Time elapsed: 10.183 s  <<< ERROR!
2023-05-10T10:39:01.2017670Z org.awaitility.core.ConditionTimeoutException: org.apache.camel.quarkus.component.file.it.FileTest expected "Hello2" but was "" within 10 seconds.
2023-05-10T10:39:01.2018398Z 	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
2023-05-10T10:39:01.2019038Z 	at org.awaitility.core.AbstractHamcrestCondition.await(AbstractHamcrestCondition.java:86)
2023-05-10T10:39:01.2019657Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
2023-05-10T10:39:01.2020217Z 	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:691)
2023-05-10T10:39:01.2020799Z 	at org.apache.camel.quarkus.component.file.it.FileTest.filter(FileTest.java:160)

So filter and idempotent tests are not polluting each other. This is just that the frequency of failure is very low. Let's restore idempotent then.

aldettinger avatar May 31 '23 07:05 aldettinger

I wonder if we should just disable the problem tests (perhaps just on Windows)? Or provide Surefire with the config to rerun it at least one time on failure.

jamesnetherton avatar Jun 07 '23 12:06 jamesnetherton

As far as I remember, the tests also fails without windows. If we disable them, then we'll have no chance to investigate as the failures can't be reproduced locally. I was considering about rewriting some of those tests from scratch in a more cleaner way... not sure it would be feasible for all of them. I could look at it when I have time.

aldettinger avatar Jun 07 '23 13:06 aldettinger

The filter and idempotent tests have been rewritten from scratch in a new test harness: https://github.com/apache/camel-quarkus/commit/c3ee20ad11ef7771246a0595042cfc4dd9e52bac

Let's give some time to check whether they are still flaky. If not, maybe we could rewrite all tests.

aldettinger avatar Jun 15 '23 16:06 aldettinger

On July 20th, the rewritten file tests are not flaky for more than a month. The next step would be to rewrite all tests with the new approach.

aldettinger avatar Jul 20 '23 13:07 aldettinger

2 monthes without flakiness. Time to migrate more tests to the new test harness.

aldettinger avatar Aug 17 '23 15:08 aldettinger

Every tests have been migrated to the new harness and no more flakiness on ci. Let's see how it behaves on quarkus platform ci on next release.

aldettinger avatar Mar 06 '24 07:03 aldettinger

The new harness is also fixing issues on the quarkus-platform as shown in this pr.

aldettinger avatar Mar 25 '24 13:03 aldettinger