[FLINK-35103] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

Open swathic95 opened this issue 1 year ago • 1 comments

What is the purpose of the change

https://issues.apache.org/jira/browse/FLINK-35103

Currently, whenever we have flink failures, we need to manually do the triaging by looking into the flink logs even for the initial analysis. It would have been better, if the user/admin directly gets the initial failure information even before looking into the logs.

To address this, we've developed a comprehensive solution via a plugin aimed at helping fetch the Flink failures, ensuring critical data is preserved for subsequent analysis and action.

In Kubernetes environments, troubleshooting pod failures can be challenging without checking the pod/flink logs. Fortunately, Kubernetes offers a robust mechanism to enhance debugging capabilities by leveraging the /dev/termination-log file.

https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/

By writing failure information to this log, Kubernetes automatically incorporates it into the container status, providing administrators and developers with valuable insights into the root cause of failures.

Our solution capitalizes on this Kubernetes feature to seamlessly integrate Flink failure reporting within the container ecosystem. Whenever a Flink encounters an issue, our plugin dynamically captures and logs the pertinent failure information into the /dev/termination-log file. This ensures that Kubernetes recognizes and propagates the failure status throughout the container ecosystem, enabling efficient monitoring and response mechanisms.

By leveraging Kubernetes' native functionality in this manner, our plugin ensures that Flink failure incidents are promptly identified and reflected in the pod status. This technical integration streamlines the debugging process, empowering operators to swiftly diagnose and address issues, thereby minimizing downtime and maximizing system reliability.

In-order to make this plugin generic, by default it doesn't do any action. We can configure this by using

Brief change log

Added a new plugin. By default, it does not do anything. When its configured via the flink-conf, to enable K8STerminationLog, it writes to /dev/termination-log file.

external.log.factory.class : org.apache.flink.externalresource.log.K8SSupportTerminationLog

This will be present in the plugins directory

The pod status would show the error message like below if there are any flink issues :

Others, can add their own implementation if needed in the future to support other use cases for flink on Yarn and standalone.

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Apr 15 '24 06:04 swathic95

CI report:

8b00ef071cfe932d56666b868a496090711345e6 UNKNOWN
1bf5e079c8ee417b25e9c88977e82030d0166ebc Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Apr 15 '24 06:04 flinkbot