incubator-streampark [Improve] auto-probe improvement

Search before asking

[X] I had searched in the feature and found no similar feature requirement.

Description

This issue is a derivative of https://github.com/apache/incubator-streampark/issues/2944 and primarily discusses the optimization of the auto-probe job process. The diagram below illustrates the entire job auto-probe process.

自动探活功能流程图

In the optimization of the entire job auto-probe process, we particularly focused on the following aspects:

We added a manual job probe button on the Streampark front-end page to prevent jobs from escaping the probe process after multiple failed attempts, ensuring that they can be relaunched by the job probe monitoring system.
When jobs run remote、YARN session or K8s session, to ensure consistency between job and the deployed cluster's states, we have introduced the following logic: a. If a job is successfully probe and in the running state, and its associated cluster is in a LOST state, we update the cluster's status to running. b. If after successfully probe jobs under a cluster, it is found that no jobs are in the running state, manual triggering of the cluster's probe is required to update the cluster's status.
During a round of probe, we define the end-of-round criteria as follows: We consider the current probe round to be completed if there are no jobs in a LOST state or if the jobs in a LOST state have reached the maximum probe retry count. At this point, we notify the user of the probe round's statistical results.

Usage Scenario

No response

Related issues

https://github.com/apache/incubator-streampark/issues/2944

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Oct 13 '23 09:10 xujiangfeng001

Thx for your proposal! 🎉 Tracking the status of the cluster is crucial, so let's synchronize some info first.

LOST status: When the watcher sends an HTTP request to the Flink job (Flink WebUI) or the cluster (Flink cluster | Yarn cluster) but doesn't receive a response due to network issues or high machine load. The job/cluster might still be running normally or it could have stopped. In this case, we can't determine the running status and have to set it as LOST.

Now, let's discuss monitoring and automatic detection for jobs in the LOST state. I have a few suggestions for this:

In the FlinkAppHttpWatcher, when a job or cluster is detected as LOST (no response from the HTTP request), I suggest not immediately triggering an alert or notifying the user. we can simply mark the job or cluster as LOST without taking further action.
In the FlinkAppLostWatcher, re-requesting http for the lost status job or cluster, if a response is received, we should update the job status accordingly: a. If the job is still running, there's no need to notify the user. b. If the job has failed or cancel, we should notify the user.
In the FlinkAppLostWatcher, re-requesting http for the lost status job or cluster, if the job (cluster) status is still LOST, we should continue retrying. However, if the number of retries reaches a certain threshold, we can consider the job truly lost and stop retrying. At this point, we need to notify the user.

Oct 15 '23 14:10 wolfboys

Thx for your proposal! 🎉 Tracking the status of the cluster is crucial, so let's synchronize some info first.

LOST status: When the watcher sends an HTTP request to the Flink job (Flink WebUI) or the cluster (Flink cluster | Yarn cluster) but doesn't receive a response due to network issues or high machine load. The job/cluster might still be running normally or it could have stopped. In this case, we can't determine the running status and have to set it as LOST.

Now, let's discuss monitoring and automatic detection for jobs in the LOST state. I have a few suggestions for this:

In the FlinkAppHttpWatcher, when a job or cluster is detected as LOST (no response from the HTTP request), I suggest not immediately triggering an alert or notifying the user. we can simply mark the job or cluster as LOST without taking further action.

In the FlinkAppLostWatcher, re-requesting http for the lost status job or cluster, if a response is received, we should update the job status accordingly: a. If the job is still running, there's no need to notify the user. b. If the job has failed or cancel, we should notify the user.

In the FlinkAppLostWatcher, re-requesting http for the lost status job or cluster, if the job (cluster) status is still LOST, we should continue retrying. However, if the number of retries reaches a certain threshold, we can consider the job truly lost and stop retrying. At this point, we need to notify the user.

Thank you for your suggestions, this looks very promising. I will incorporate these logics into the development process.

Oct 16 '23 10:10 xujiangfeng001