kafka KAFKA-17433 Add a deflake Github action

This patch adds a "deflake" github action which can be used to run a single JUnit test or suites. It works by parameterizing the --tests Gradle option. If the test extends ClusterTest, the "deflake" workflow can repeat number of times by setting the kafka.cluster.test.repeat system property.

This can be done locally as well:

./gradlew -Pkafka.cluster.test.repeat=3 :core:test --tests "*ZkMigrationIntegrationTest*"

For local testing, IDEA also has options for repeating a test until failure.

Aug 27 '24 18:08 mumrah

An example of the new workflow https://github.com/mumrah/kafka/actions/runs/10582749491/job/29325023681

Screenshot of the action dispatch dialog

Aug 27 '24 18:08 mumrah

@mumrah not sure whether we should encourage developers loop flaky on Github CI. The quota is limited and so it could impact the other flow (normal PR and CI). Also, README offers a simple command to loop tests (https://github.com/apache/kafka?tab=readme-ov-file#repeatedly-running-a-particular-unitintegration-test) on local.

Aug 28 '24 01:08 chia7712

@chia7712 thanks for the feedback.

The quota is limited and so it could impact the other flow (normal PR and CI)

Since this workflow is run manually, I think the impact would be limited. Also, as long as the caller isn't running a whole module's tests, it should only run of a few minutes. I've set a timeout of 1hr to the job to prevent using up too much run time.

Also, README offers a simple command

I didn't realize that :) I think this method of repeating a test is not actually very useful since it's just running the Gradle command over and over. Often times, flaky tests only appear when the system is under load. Invoking Gradle in a loop gives too much time for the system to "settle" in between runs.

This is why I normally use (and recommend) the IntelliJ "Run Until Failed" option while running tests in IntelliJ (not Gradle). It runs in a tight loop and puts some load on the system.

Even still, I've run into plenty of cases where a test is only failing in CI. Not having the ability to run a single test in CI makes it really hard to debug such cases. It essentially means each trial of your bugfix requires waiting for a full CI run. A workaround to this I've used in the past is to alter the Jenkinsfile to just run the tests I want.

I think a dedicated job for running a single test is a better option for developers.

Another benefit of having this new workflow is it gives us easily shareable evidence when submitting test fixes. Instead of a reviewer looking at a single CI run for pass/fail, the author can give a link to a 10x deflake run which gives stronger evidence of the fix.

Let me know what you think

Aug 28 '24 15:08 mumrah

Even still, I've run into plenty of cases where a test is only failing in CI. Not having the ability to run a single test in CI makes it really hard to debug such cases. It essentially means each trial of your bugfix requires waiting for a full CI run. A workaround to this I've used in the past is to alter the Jenkinsfile to just run the tests I want.

that is true

Another benefit of having this new workflow is it gives us easily shareable evidence when submitting test fixes. Instead of a reviewer looking at a single CI run for pass/fail, the author can give a link to a 10x deflake run which gives stronger evidence of the fix.

I love this :)

Aug 28 '24 15:08 chia7712

@chia7712 thanks for the reviews! I've incorporated your feedback and tested it locally. Seems to be working 👍

Aug 29 '24 14:08 mumrah

I wrote a short guide on flaky tests on the Kafka wiki https://cwiki.apache.org/confluence/display/KAFKA/Flaky+Tests

Sep 02 '24 16:09 mumrah