seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Fix][E2E] Limit Flink restart attempts to avoid infinite retries masking real errors

Open yzeng1618 opened this issue 3 weeks ago • 2 comments

Purpose of this pull request

This PR limits Flink's restart attempts in E2E tests to avoid infinite retry loops that mask real test failures with timeout errors. image image image

The job continues running with checkpoints triggered every 10 seconds (checkpoint 2, 3, 4, 5... incrementing continuously), and the job never actually fails or exits - it just keeps running in a restart loop until the CI timeout cancels it.

Problem: In the current Flink E2E test configuration, the default restart strategy uses maxNumberRestartAttempts=2147483647 (Integer.MAX_VALUE), causing Flink to restart failed jobs indefinitely. When a test case fails (e.g., assertion validation error), Flink keeps restarting the job instead of failing fast. This results in:

CI jobs timing out after ~1 hour of continuous restart loops Real error messages (like AssertConnectorException: Rule validate failed) being hidden behind generic timeout errors (The operation was canceled) Difficult debugging as developers cannot see the actual failure cause

Solution: Configure a fixed-delay restart strategy with a maximum of 2 restart attempts and 1-second delay between retries. This allows legitimate transient failures to recover while ensuring true test failures fail fast with clear error messages.

Does this PR introduce any user-facing change?

No. This change only affects E2E test infrastructure configuration and does not impact production code or user-facing behavior.

How was this patch tested?

  • The fix was applied to AbstractTestFlinkContainer.java which is used by all Flink-based E2E tests

  • Verified that failed tests now fail quickly with proper error messages instead of timing out after ~1 hour

  • Existing E2E tests continue to pass as the restart strategy still allows for legitimate transient failure recovery (2 retry attempts)

Check list

  • [ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
  • [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
  • [ ] If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
  • [ ] If you are contributing the connector code, please check that the following files are updated:
    1. Update plugin-mapping.properties and add new connector information in it
    2. Update the pom file of seatunnel-dist
    3. Add ci label in label-scope-conf
    4. Add e2e testcase in seatunnel-e2e
    5. Update connector plugin_config

yzeng1618 avatar Dec 04 '25 02:12 yzeng1618

image 修改后日志对应的内容是如图所示

yzeng1618 avatar Dec 04 '25 10:12 yzeng1618

The “extra files” I modified are mainly test configurations and test cases that are tightly coupled with the new restart strategy:

  • data_validator_fail.conf and the fake-to-redis-test-readonly-*.conf files: restart is explicitly disabled for tests that are expected to fail, to avoid endless retries and CI timeouts. These changes are a companion adjustment to the global restart limits introduced in AbstractTestFlinkContainer.java.

  • KafkaIT.java and kafka_to_kafka_exactly_once_streaming.conf: the Kafka error-handling logic and the exactly-once streaming scenario are updated to behave deterministically under the new restart/failure semantics, preventing mismatches with test expectations and reducing CI flakiness or unexpected failures.

yzeng1618 avatar Dec 08 '25 07:12 yzeng1618