node-healthcheck-operator NHC healthy delay

Why we need this PR

We'd like to enable a configuration which delays node returning to health. In case a delay is configured NHC will delete the remediation only after a node was healthy for the configured time. A negative value means the node will not be considered healthy and a manual intervention is expected.

Motivation for this came from several customers which require more control on when taints are removed from the node and experienced use case where node regains health for a short period of time.

Changes made

Adding configuration which enables considering node unhealthy until a period of time has passed. An annotation on CR is used to manage the delay, and a status update is added on the CR to note isn't healthy because of the delay

Which issue(s) this PR fixes

RHWA-10

Test plan

Apr 15 '25 13:04 mshitrit

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

Apr 15 '25 13:04 openshift-ci[bot]

/test 4.17-openshift-e2e

Apr 21 '25 07:04 mshitrit

/test 4.17-openshift-e2e

May 11 '25 19:05 mshitrit

Walkthrough

This update introduces a configurable "healthy delay" feature to the NodeHealthCheck (NHC) system. A new field allows specifying a delay before a node is considered healthy after recovery, with corresponding API, CRD, and status schema changes. The reconciliation logic, resource manager, and tests are updated to support and verify delayed healthy state recognition and remediation CR deletion.

Changes

File(s)	Change Summary
api/v1alpha1/nodehealthcheck_types.go, bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml, config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml	Added `healthyDelay` field to NHC spec and `healthyDelayed` to unhealthy node status; updated CRD and schema for new fields.
api/v1alpha1/zz_generated.deepcopy.go	Updated `DeepCopyInto` for `NodeHealthCheckSpec` and `UnhealthyNode` to copy new fields `HealthyDelay` and `HealthyDelayed`.
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml, config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml	Added spec and status descriptors for `healthyDelay` and `healthyDelayed` in CSV manifests.
controllers/resources/manager.go	Modified `HandleHealthyNode` to support delayed remediation CR deletion; added delay calculation logic and new annotation handling; updated `CleanUp` method.
controllers/resources/status.go	Added functions to update and check delayed healthy status in NHC status.
controllers/nodehealthcheck_controller.go	Passed `HealthyDelay` to context; updated healthy node handling to process and propagate delay for reconciliation requeue; updated cleanup call.
controllers/machinehealthcheck_controller.go	Adjusted to ignore new return value from `HandleHealthyNode` signature change.
controllers/nodehealthcheck_controller_test.go	Refactored test helper for healthy node; added tests for healthy delay behavior including indefinite delay and manual confirmation.
controllers/shared.go	Added annotation-based reconciliation trigger for manual healthy confirmation annotation changes.
docs/configuration.md	Documented the `healthyDelay` field, its usage, rationale, and manual intervention methods.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler
    participant ResourceManager
    participant Node
    participant RemediationCR

    Reconciler->>ResourceManager: HandleHealthyNode(nodeName, crName, owner)
    ResourceManager->>RemediationCR: List remediation CRs for node
    loop For each remediation CR
        ResourceManager->>RemediationCR: Check for healthy delay annotation
        alt No annotation
            ResourceManager->>RemediationCR: Set delay start annotation (now)
            ResourceManager-->>Reconciler: Return delay duration
        else Annotation present
            ResourceManager->>ResourceManager: Calculate time left
            alt Delay not expired
                ResourceManager-->>Reconciler: Return remaining delay
            else Delay expired
                ResourceManager->>RemediationCR: Delete CR
            end
        end
    end
    ResourceManager-->>Reconciler: Return shortest delay for requeue

Suggested labels

ok-to-test

Poem

Hopping through code with a healthy delay,
Now nodes take time before they can say
"I'm healthy again!"—not too soon, not too late,
Remediation CRs pause at the gate.
With tests that cheer and status anew,
This rabbit’s proud of what you do!
🐇⏳

📜 Recent review details

Configuration used: CodeRabbit UI Review profile: CHILL Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 226c757d0bb0d09bca11c207e821bf92ca9e57d4 and 1ec21b4134ae2934dd136150a164f4ae05fba5d8.

📒 Files selected for processing (13)

api/v1alpha1/nodehealthcheck_types.go (2 hunks)
api/v1alpha1/zz_generated.deepcopy.go (2 hunks)
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (2 hunks)
bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (2 hunks)
config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (2 hunks)
config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (2 hunks)
controllers/machinehealthcheck_controller.go (1 hunks)
controllers/nodehealthcheck_controller.go (3 hunks)
controllers/nodehealthcheck_controller_test.go (5 hunks)
controllers/resources/manager.go (4 hunks)
controllers/resources/status.go (2 hunks)
controllers/shared.go (3 hunks)
docs/configuration.md (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (10)

controllers/nodehealthcheck_controller.go
controllers/machinehealthcheck_controller.go
config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml
api/v1alpha1/zz_generated.deepcopy.go
bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml
api/v1alpha1/nodehealthcheck_types.go
controllers/resources/status.go
config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml
controllers/nodehealthcheck_controller_test.go

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).

controllers/shared.go (3)

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.

docs/configuration.md (3)

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.

controllers/resources/manager.go (3)

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).

Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.

🧬 Code Graph Analysis (1)

controllers/shared.go (1)

controllers/resources/manager.go (1)

RemediationManuallyConfirmedHealthyAnnotationKey (35-35)

🪛 LanguageTool

docs/configuration.md

[uncategorized] ~202-~202: Possible missing comma found. Context: ... Briefly regain health for a very short period only to become unhealthy again. - Requi...

(AI_HYDRA_LEO_MISSING_COMMA)

🔇 Additional comments (9)

controllers/shared.go (2)

77-83: LGTM! Clean implementation following established patterns.

The new annotationsNeedReconcile function correctly follows the same pattern as labelsNeedReconcile, checking for presence changes of the manual confirmation annotation. This ensures reconciliation is properly triggered when the annotation is added or removed from nodes.

29-31: Good integration with existing reconciliation logic.

The addition of annotationsNeedReconcile to the existing OR condition chain is well-placed and maintains the logical flow of the reconciliation trigger checks.

docs/configuration.md (2)

195-226: Excellent comprehensive documentation for the healthyDelay feature.

The documentation thoroughly covers:

Clear motivation and use cases (node "flapping" and validation periods)

Detailed explanation of delay behavior

Two distinct manual intervention methods with step-by-step instructions

Node-specific vs CR-wide impact considerations

This provides users with all the information needed to understand and use the feature effectively.

61-61: Well-integrated table entry for the new field.

The healthyDelay field documentation fits naturally into the existing spec table format and provides a concise but complete description.

controllers/resources/manager.go (5)

30-36: Well-defined constants for the healthy delay feature.

The constants are appropriately named and documented:

HealthyDelayContextKey for context value passing

RemediationHealthyDelayAnnotationKey for CR delay tracking

RemediationManuallyConfirmedHealthyAnnotationKey for manual node override

The naming convention follows existing patterns and the comments clearly explain their purpose.

321-383: Robust implementation of healthy node handling with delay support.

The enhanced HandleHealthyNode method correctly:

Returns requeue duration for delay management

Integrates manual confirmation annotation checking

Implements delay logic with proper error handling

Uses clear control flow with the shouldDelete flag

Updates status appropriately when delays are active

The implementation properly handles all delay scenarios and maintains backward compatibility.

385-425: Comprehensive delay calculation with proper time management.

The calcCrDeletionDelay method effectively handles all delay scenarios:

No delay configured: immediate deletion

Zero delay: immediate deletion

Negative delay: permanent postponement

Positive delay: time-based postponement with RFC3339 timestamp tracking

The annotation-based persistence approach ensures delay state survives reconciliation cycles, and the time calculations are accurate.

427-454: Clean annotation management helpers.

The helper methods isManuallyConfirmedHealthyAnnotationSet and removeConfirmedHealthyAnnotation are well-implemented:

Proper error handling for node retrieval

Safe annotation checking and removal

Clear method names and functionality

These provide good separation of concerns for annotation management.

456-465: Appropriate cleanup integration.

The updated CleanUp method properly integrates manual confirmation handling by removing the annotation when all CRs are deleted. The boolean parameter clearly indicates when this cleanup should occur.

✨ Finishing Touches

[ ] 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

May 15 '25 07:05 coderabbitai[bot]

giving the chance to get further reviews

//cc @slintes

May 19 '25 08:05 mshitrit

@coderabbitai review

May 20 '25 15:05 slintes

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

May 20 '25 15:05 coderabbitai[bot]

@coderabbitai full review

May 20 '25 15:05 slintes

✅ Actions performed

Full review triggered.

May 20 '25 15:05 coderabbitai[bot]

nodeUpdateNeedsReconcile() needs to handle annotations

Good point. IIUC currently this part being omitted means that the annotation change does not trigger the reconcile. I wonder though why did the unit test pass. Maybe the reconcile was triggered due to a different reason, node status heartbeat maybe :thinking:

Jun 25 '25 14:06 mshitrit

/test 4.17-openshift-e2e

Jun 30 '25 08:06 slintes

/test 4.16-openshift-e2e /test 4.17-openshift-e2e

Jun 30 '25 12:06 slintes

/test 4.16-openshift-e2e /test 4.17-openshift-e2e

Jul 01 '25 08:07 slintes

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mshitrit,slintes]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jul 03 '25 14:07 openshift-ci[bot]

/retest

Jul 08 '25 07:07 mshitrit

/override 4.20-openshift-e2e

Jul 08 '25 11:07 mshitrit

@mshitrit: /override requires failed status contexts, check run or a prowjob name to operate on. The following unknown contexts/checkruns were given:

4.20-openshift-e2e

Only the following failed contexts/checkruns were expected:

CodeRabbit
ci/prow/4.16-ci-bundle-my-bundle
ci/prow/4.16-images
ci/prow/4.16-openshift-e2e
ci/prow/4.16-test
ci/prow/4.17-ci-bundle-my-bundle
ci/prow/4.17-images
ci/prow/4.17-openshift-e2e
ci/prow/4.17-test
ci/prow/4.18-ci-bundle-my-bundle
ci/prow/4.18-images
ci/prow/4.18-openshift-e2e
ci/prow/4.18-test
ci/prow/4.19-ci-bundle-my-bundle
ci/prow/4.19-images
ci/prow/4.19-openshift-e2e
ci/prow/4.19-test
ci/prow/4.20-ci-bundle-my-bundle
ci/prow/4.20-images
ci/prow/4.20-openshift-e2e
ci/prow/4.20-test
pull-ci-medik8s-node-healthcheck-operator-main-4.16-ci-bundle-my-bundle
pull-ci-medik8s-node-healthcheck-operator-main-4.16-images
pull-ci-medik8s-node-healthcheck-operator-main-4.16-openshift-e2e
pull-ci-medik8s-node-healthcheck-operator-main-4.16-test
pull-ci-medik8s-node-healthcheck-operator-main-4.17-ci-bundle-my-bundle
pull-ci-medik8s-node-healthcheck-operator-main-4.17-images
pull-ci-medik8s-node-healthcheck-operator-main-4.17-openshift-e2e
pull-ci-medik8s-node-healthcheck-operator-main-4.17-test
pull-ci-medik8s-node-healthcheck-operator-main-4.18-ci-bundle-my-bundle
pull-ci-medik8s-node-healthcheck-operator-main-4.18-images
pull-ci-medik8s-node-healthcheck-operator-main-4.18-openshift-e2e
pull-ci-medik8s-node-healthcheck-operator-main-4.18-test
pull-ci-medik8s-node-healthcheck-operator-main-4.19-ci-bundle-my-bundle
pull-ci-medik8s-node-healthcheck-operator-main-4.19-images
pull-ci-medik8s-node-healthcheck-operator-main-4.19-openshift-e2e
pull-ci-medik8s-node-healthcheck-operator-main-4.19-test
pull-ci-medik8s-node-healthcheck-operator-main-4.20-ci-bundle-my-bundle
pull-ci-medik8s-node-healthcheck-operator-main-4.20-images
pull-ci-medik8s-node-healthcheck-operator-main-4.20-openshift-e2e
pull-ci-medik8s-node-healthcheck-operator-main-4.20-test
tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

/override 4.20-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 08 '25 11:07 openshift-ci[bot]

/override ci/prow/4.20-openshift-e2e

Jul 08 '25 11:07 mshitrit

@mshitrit: Overrode contexts on behalf of mshitrit: ci/prow/4.20-openshift-e2e

In response to this:

/override ci/prow/4.20-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 08 '25 11:07 openshift-ci[bot]

NHC healthy delay

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

Walkthrough

Changes

Sequence Diagram(s)

Suggested labels

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

CodeRabbit Configuration File (`.coderabbit.yaml`)