NHC healthy delay
Why we need this PR
We'd like to enable a configuration which delays node returning to health. In case a delay is configured NHC will delete the remediation only after a node was healthy for the configured time. A negative value means the node will not be considered healthy and a manual intervention is expected.
Motivation for this came from several customers which require more control on when taints are removed from the node and experienced use case where node regains health for a short period of time.
Changes made
Adding configuration which enables considering node unhealthy until a period of time has passed. An annotation on CR is used to manage the delay, and a status update is added on the CR to note isn't healthy because of the delay
Which issue(s) this PR fixes
Test plan
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
/test 4.17-openshift-e2e
/test 4.17-openshift-e2e
Walkthrough
This update introduces a configurable "healthy delay" feature to the NodeHealthCheck (NHC) system. A new field allows specifying a delay before a node is considered healthy after recovery, with corresponding API, CRD, and status schema changes. The reconciliation logic, resource manager, and tests are updated to support and verify delayed healthy state recognition and remediation CR deletion.
Changes
| File(s) | Change Summary |
|---|---|
| api/v1alpha1/nodehealthcheck_types.go, bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml, config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml | Added healthyDelay field to NHC spec and healthyDelayed to unhealthy node status; updated CRD and schema for new fields. |
| api/v1alpha1/zz_generated.deepcopy.go | Updated DeepCopyInto for NodeHealthCheckSpec and UnhealthyNode to copy new fields HealthyDelay and HealthyDelayed. |
| bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml, config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml | Added spec and status descriptors for healthyDelay and healthyDelayed in CSV manifests. |
| controllers/resources/manager.go | Modified HandleHealthyNode to support delayed remediation CR deletion; added delay calculation logic and new annotation handling; updated CleanUp method. |
| controllers/resources/status.go | Added functions to update and check delayed healthy status in NHC status. |
| controllers/nodehealthcheck_controller.go | Passed HealthyDelay to context; updated healthy node handling to process and propagate delay for reconciliation requeue; updated cleanup call. |
| controllers/machinehealthcheck_controller.go | Adjusted to ignore new return value from HandleHealthyNode signature change. |
| controllers/nodehealthcheck_controller_test.go | Refactored test helper for healthy node; added tests for healthy delay behavior including indefinite delay and manual confirmation. |
| controllers/shared.go | Added annotation-based reconciliation trigger for manual healthy confirmation annotation changes. |
| docs/configuration.md | Documented the healthyDelay field, its usage, rationale, and manual intervention methods. |
Sequence Diagram(s)
sequenceDiagram
participant Reconciler
participant ResourceManager
participant Node
participant RemediationCR
Reconciler->>ResourceManager: HandleHealthyNode(nodeName, crName, owner)
ResourceManager->>RemediationCR: List remediation CRs for node
loop For each remediation CR
ResourceManager->>RemediationCR: Check for healthy delay annotation
alt No annotation
ResourceManager->>RemediationCR: Set delay start annotation (now)
ResourceManager-->>Reconciler: Return delay duration
else Annotation present
ResourceManager->>ResourceManager: Calculate time left
alt Delay not expired
ResourceManager-->>Reconciler: Return remaining delay
else Delay expired
ResourceManager->>RemediationCR: Delete CR
end
end
end
ResourceManager-->>Reconciler: Return shortest delay for requeue
Suggested labels
ok-to-test
Poem
Hopping through code with a healthy delay,
Now nodes take time before they can say
"I'm healthy again!"βnot too soon, not too late,
Remediation CRs pause at the gate.
With tests that cheer and status anew,
This rabbitβs proud of what you do!
πβ³
π Recent review details
Configuration used: CodeRabbit UI Review profile: CHILL Plan: Pro
π₯ Commits
Reviewing files that changed from the base of the PR and between 226c757d0bb0d09bca11c207e821bf92ca9e57d4 and 1ec21b4134ae2934dd136150a164f4ae05fba5d8.
π Files selected for processing (13)
-
api/v1alpha1/nodehealthcheck_types.go(2 hunks) -
api/v1alpha1/zz_generated.deepcopy.go(2 hunks) -
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml(2 hunks) -
bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml(2 hunks) -
config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml(2 hunks) -
config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml(2 hunks) -
controllers/machinehealthcheck_controller.go(1 hunks) -
controllers/nodehealthcheck_controller.go(3 hunks) -
controllers/nodehealthcheck_controller_test.go(5 hunks) -
controllers/resources/manager.go(4 hunks) -
controllers/resources/status.go(2 hunks) -
controllers/shared.go(3 hunks) -
docs/configuration.md(2 hunks)
π§ Files skipped from review as they are similar to previous changes (10)
- controllers/nodehealthcheck_controller.go
- controllers/machinehealthcheck_controller.go
- config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml
- bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml
- api/v1alpha1/zz_generated.deepcopy.go
- bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml
- api/v1alpha1/nodehealthcheck_types.go
- controllers/resources/status.go
- config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml
- controllers/nodehealthcheck_controller_test.go
π§° Additional context used
π§ Learnings (4)
π Common learnings
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).
controllers/shared.go (3)
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.
docs/configuration.md (3)
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.
controllers/resources/manager.go (3)
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.
𧬠Code Graph Analysis (1)
controllers/shared.go (1)
controllers/resources/manager.go (1)
RemediationManuallyConfirmedHealthyAnnotationKey(35-35)
πͺ LanguageTool
docs/configuration.md
[uncategorized] ~202-~202: Possible missing comma found. Context: ... Briefly regain health for a very short period only to become unhealthy again. - Requi...
(AI_HYDRA_LEO_MISSING_COMMA)
π Additional comments (9)
controllers/shared.go (2)
77-83: LGTM! Clean implementation following established patterns.The new
annotationsNeedReconcilefunction correctly follows the same pattern aslabelsNeedReconcile, checking for presence changes of the manual confirmation annotation. This ensures reconciliation is properly triggered when the annotation is added or removed from nodes.
29-31: Good integration with existing reconciliation logic.The addition of
annotationsNeedReconcileto the existing OR condition chain is well-placed and maintains the logical flow of the reconciliation trigger checks.docs/configuration.md (2)
195-226: Excellent comprehensive documentation for the healthyDelay feature.The documentation thoroughly covers:
- Clear motivation and use cases (node "flapping" and validation periods)
- Detailed explanation of delay behavior
- Two distinct manual intervention methods with step-by-step instructions
- Node-specific vs CR-wide impact considerations
This provides users with all the information needed to understand and use the feature effectively.
61-61: Well-integrated table entry for the new field.The healthyDelay field documentation fits naturally into the existing spec table format and provides a concise but complete description.
controllers/resources/manager.go (5)
30-36: Well-defined constants for the healthy delay feature.The constants are appropriately named and documented:
HealthyDelayContextKeyfor context value passingRemediationHealthyDelayAnnotationKeyfor CR delay trackingRemediationManuallyConfirmedHealthyAnnotationKeyfor manual node overrideThe naming convention follows existing patterns and the comments clearly explain their purpose.
321-383: Robust implementation of healthy node handling with delay support.The enhanced
HandleHealthyNodemethod correctly:
- Returns requeue duration for delay management
- Integrates manual confirmation annotation checking
- Implements delay logic with proper error handling
- Uses clear control flow with the
shouldDeleteflag- Updates status appropriately when delays are active
The implementation properly handles all delay scenarios and maintains backward compatibility.
385-425: Comprehensive delay calculation with proper time management.The
calcCrDeletionDelaymethod effectively handles all delay scenarios:
- No delay configured: immediate deletion
- Zero delay: immediate deletion
- Negative delay: permanent postponement
- Positive delay: time-based postponement with RFC3339 timestamp tracking
The annotation-based persistence approach ensures delay state survives reconciliation cycles, and the time calculations are accurate.
427-454: Clean annotation management helpers.The helper methods
isManuallyConfirmedHealthyAnnotationSetandremoveConfirmedHealthyAnnotationare well-implemented:
- Proper error handling for node retrieval
- Safe annotation checking and removal
- Clear method names and functionality
These provide good separation of concerns for annotation management.
456-465: Appropriate cleanup integration.The updated
CleanUpmethod properly integrates manual confirmation handling by removing the annotation when all CRs are deleted. The boolean parameter clearly indicates when this cleanup should occur.
β¨ Finishing Touches
- [ ] π Generate Docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
πͺ§ Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-
I pushed a fix in commit <commit_id>, please review it. -
Explain this complex logic. -
Open a follow-up GitHub issue for this discussion.
-
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:-
@coderabbitai explain this code block. -
@coderabbitai modularize this function.
-
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:-
@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase. -
@coderabbitai read src/utils.ts and explain its main purpose. -
@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format. -
@coderabbitai help me debug CodeRabbit configuration file.
-
Support
Need help? Create a ticket on our support page for assistance with any issues or questions.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
-
@coderabbitai pauseto pause the reviews on a PR. -
@coderabbitai resumeto resume the paused reviews. -
@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository. -
@coderabbitai full reviewto do a full review from scratch and review all the files again. -
@coderabbitai summaryto regenerate the summary of the PR. -
@coderabbitai generate docstringsto generate docstrings for this PR. -
@coderabbitai generate sequence diagramto generate a sequence diagram of the changes in this PR. -
@coderabbitai resolveresolve all the CodeRabbit review comments. -
@coderabbitai configurationto show the current CodeRabbit configuration for the repository. -
@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
giving the chance to get further reviews
//cc @slintes
@coderabbitai review
β Actions performed
Review triggered.
Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.
@coderabbitai full review
β Actions performed
Full review triggered.
nodeUpdateNeedsReconcile()needs to handle annotations
Good point. IIUC currently this part being omitted means that the annotation change does not trigger the reconcile. I wonder though why did the unit test pass. Maybe the reconcile was triggered due to a different reason, node status heartbeat maybe :thinking:
/test 4.17-openshift-e2e
/test 4.16-openshift-e2e /test 4.17-openshift-e2e
/test 4.16-openshift-e2e /test 4.17-openshift-e2e
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: mshitrit, slintes
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mshitrit,slintes]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/retest
/override 4.20-openshift-e2e
@mshitrit: /override requires failed status contexts, check run or a prowjob name to operate on. The following unknown contexts/checkruns were given:
-
4.20-openshift-e2e
Only the following failed contexts/checkruns were expected:
-
CodeRabbit -
ci/prow/4.16-ci-bundle-my-bundle -
ci/prow/4.16-images -
ci/prow/4.16-openshift-e2e -
ci/prow/4.16-test -
ci/prow/4.17-ci-bundle-my-bundle -
ci/prow/4.17-images -
ci/prow/4.17-openshift-e2e -
ci/prow/4.17-test -
ci/prow/4.18-ci-bundle-my-bundle -
ci/prow/4.18-images -
ci/prow/4.18-openshift-e2e -
ci/prow/4.18-test -
ci/prow/4.19-ci-bundle-my-bundle -
ci/prow/4.19-images -
ci/prow/4.19-openshift-e2e -
ci/prow/4.19-test -
ci/prow/4.20-ci-bundle-my-bundle -
ci/prow/4.20-images -
ci/prow/4.20-openshift-e2e -
ci/prow/4.20-test -
pull-ci-medik8s-node-healthcheck-operator-main-4.16-ci-bundle-my-bundle -
pull-ci-medik8s-node-healthcheck-operator-main-4.16-images -
pull-ci-medik8s-node-healthcheck-operator-main-4.16-openshift-e2e -
pull-ci-medik8s-node-healthcheck-operator-main-4.16-test -
pull-ci-medik8s-node-healthcheck-operator-main-4.17-ci-bundle-my-bundle -
pull-ci-medik8s-node-healthcheck-operator-main-4.17-images -
pull-ci-medik8s-node-healthcheck-operator-main-4.17-openshift-e2e -
pull-ci-medik8s-node-healthcheck-operator-main-4.17-test -
pull-ci-medik8s-node-healthcheck-operator-main-4.18-ci-bundle-my-bundle -
pull-ci-medik8s-node-healthcheck-operator-main-4.18-images -
pull-ci-medik8s-node-healthcheck-operator-main-4.18-openshift-e2e -
pull-ci-medik8s-node-healthcheck-operator-main-4.18-test -
pull-ci-medik8s-node-healthcheck-operator-main-4.19-ci-bundle-my-bundle -
pull-ci-medik8s-node-healthcheck-operator-main-4.19-images -
pull-ci-medik8s-node-healthcheck-operator-main-4.19-openshift-e2e -
pull-ci-medik8s-node-healthcheck-operator-main-4.19-test -
pull-ci-medik8s-node-healthcheck-operator-main-4.20-ci-bundle-my-bundle -
pull-ci-medik8s-node-healthcheck-operator-main-4.20-images -
pull-ci-medik8s-node-healthcheck-operator-main-4.20-openshift-e2e -
pull-ci-medik8s-node-healthcheck-operator-main-4.20-test -
tide
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.
In response to this:
/override 4.20-openshift-e2e
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/override ci/prow/4.20-openshift-e2e
@mshitrit: Overrode contexts on behalf of mshitrit: ci/prow/4.20-openshift-e2e
In response to this:
/override ci/prow/4.20-openshift-e2e
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.