gitops-engine icon indicating copy to clipboard operation
gitops-engine copied to clipboard

fix: pod with a restart policy of Never or OnFailure stuck at 'Progressing' (#15317)

Open RoelofKuijpers opened this issue 9 months ago • 12 comments

This implementation extends the health condition check for pods. Previously the assumption was that Pods with restart policy of Never or OnFailure are hooks with a finite life, these were considered as Progressing instead of Healthy. However, this logic does not apply when the pod is managed by an operator (e.g., Flink operator) and therefore has a restart policy of Never. We introduce a new annotation which existence is checked when the pod is Running, that allows for skipping this logic on restart policy.

RoelofKuijpers avatar Apr 09 '25 07:04 RoelofKuijpers

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 47.34%. Comparing base (8849c3f) to head (b216058). :warning: Report is 55 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #709      +/-   ##
==========================================
- Coverage   54.26%   47.34%   -6.93%     
==========================================
  Files          64       64              
  Lines        6164     6537     +373     
==========================================
- Hits         3345     3095     -250     
- Misses       2549     3187     +638     
+ Partials      270      255      -15     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Apr 09 '25 08:04 codecov[bot]

This looks like a good approach to the problem.

drewhemm avatar Apr 09 '25 09:04 drewhemm

The pod manifest needs the following to pass the tests:

  • Compute and storage resources defined
  • The alpine tag needs to use something other than latest, e.g. 3.21
  • Add automountServiceAccountToken: false to the pod spec, as per the Kubernetes docs

drewhemm avatar Apr 09 '25 09:04 drewhemm

@drewhemm I have made the changes you suggested to get a Quality Gate pass

RoelofKuijpers avatar Apr 09 '25 10:04 RoelofKuijpers

Cool, looks like the last blocking issue is the commit sign off.

drewhemm avatar Apr 09 '25 11:04 drewhemm

A non-blocking issue has been flagged by SonarQube, probably best to resolve it as follows:

resources:
  requests:
    ephemeral-storage: "100Mi"

drewhemm avatar Apr 09 '25 11:04 drewhemm

This needs to be merged to solve lots of subsequent bugs that have been raised.

Liammarwood avatar May 14 '25 21:05 Liammarwood

@christianh814 would you be able to give this PR a review?

Liammarwood avatar May 28 '25 02:05 Liammarwood

@crenshaw-dev would you be able to give this PR a review? Much appreciated!

RoelofKuijpers avatar Jun 11 '25 09:06 RoelofKuijpers

Docs required!

Have added info to the documentation now!

RoelofKuijpers avatar Jul 29 '25 12:07 RoelofKuijpers