managed-cluster-config OSD-25331 - updated prom queries for KubeNodeUnschedulableSRE

What type of PR is this?

(bug/feature/cleanup/documentation) Enhancement

What this PR does / why we need it?

updated prom query for - KubeNodeUnschedulableSRE

Which Jira/Github issue(s) this PR fixes?

Fixes #

Special notes for your reviewer:

Pre-checks (if applicable):

[ ] Tested latest changes against a cluster
[ ] Included documentation changes with PR
[ ] If this is a new object that is not intended for the FedRAMP environment (if unsure, please reach out to team FedRAMP), please exclude it with:
```
matchExpressions:
- key: api.openshift.com/fedramp
  operator: NotIn
  values: ["true"]
```

Sep 16 '24 11:09 vaidehi411

/lgtm

Oct 11 '24 02:10 chamalabey

Can you explain more about why this change is necessary or update the PR description to link an OSD ticket?

I believe that this change will result in a 2 hour delay before the alert fires now, hence I'm wondering what problem this is attempting to solve. We will need to have the average over time condition satisfied for 1 hour and then the alert would have to be pending for an hour to actually fire.

Edit - nvm about the ticket, I see it in the title. Let me read that first 🤦

Further Edit - My question about the timing of the alert still holds. Additionally, by changing to "average over time" this means that as a worker node is then finally drained and removed it would end up firing this alert for another hour.

I wonder what we can do about that. I wonder if we can add an "AND" condition to say that at least one kube_node_unscheduleable is still true? So some pseudocode would be something like:

( average_over_time(worker node is unscheduleable[30m]) AND worker nodes that are unscheduleable > 0 )
unless it's an upgrade node

Yet another edit: - I wonder if that AND may be unnecessary since the node-role = worker check may not return anything if the node doesn't exist, in the case where this is a node needing replacement. Though that argument would break down if it's a problem with the machineconfig daemon that put it into a SchedulingDisabled state and the machine isn't deleted.

Either way, I think the only change we'll need to make now that I've worked through this all in my head is to potentially play with the timing. 2h is a long time to pay for an instance that can't run new workloads for customers before we notify them of the issue.

Oct 24 '24 10:10 iamkirkbater

@iamkirkbater Purpose of this change is to reduce the number of SL to one that ocm-agent sends in case multiple worker nodes are cordoned. "(avg_over_time(kube_node_spec_unschedulable[1h])) == 1" will pick nodes that are unschedulable for entire last hour. We are not changing the alert fire timing here.

Oct 25 '24 03:10 vaidehi411

Purpose of this change is to reduce the number of SL to one that ocm-agent sends in case multiple worker nodes are cordoned.

Makes sense. Thank you for clarifying, I'll admit I should have read the ticket before adding that comment.

"(avg_over_time(kube_node_spec_unschedulable[1h])) == 1" will pick nodes that are unschedulable for entire last hour. We are not changing the alert fire timing here.

Forgive me if I'm wrong, but I'm under the impression that we would be inadvertently changing the alert fire timing here.

My interpretation of the alert is that by changing to an average over time of 1h, that means that the node would need to be unscheduleable for a full hour before this condition is met. I think you're saying the same thing.

However, where I think we find the issue is that the "for" condition on the promrule is unchanged from 61 minutes, adding an additional until when the alert will actually fire resulting in a service log to the customer.

For example, with our previous alert expression we immediately triggered the alert into a Pending state as soon as the node was unscheduleable. The alert would then have to be in a Pending state for 61 minutes before it moved into a "Firing" state, at which time OCM Agent would pick up on it and send the SL to the customer.

The way that I understand this change is that now we require the node to be unscheduleable for a full hour because of the avg_over_time function before it even triggers as a "Pending" alert, which would then take an additional 61 minutes to move into a Firing state where it would then send an SL to the customer.

With all of that said, this is all from memory. If you've tested this alert on an actual cluster and have seen that it will fire the SL to the customer in ~1h after being marked unscheduleable then please let me know and I'll cede my argument. If we haven't tested that as thoroughly, then I recommend we adjust the "avg_over_time" function's time to be 30 minutes and we adjust the "for" part of the promrule to 31 minutes, which would then mean that it has to be unscheduleable for 30 minutes to put the alert into a Pending state and then sit for another 31 minutes in Pending before moving to a Firing state, which would keep the timing inline with what we have today.

Oct 25 '24 12:10 iamkirkbater

Thanks Krik for the detail explanation. I tested the change on the staging cluster and alert is firing after ~2hrs, which is updating the alert fire timing here. It makes sense to me to change the avg_over_time function's time to 30 mins and adjust the "for" part of the prom-rule to 31 minutes as per your suggestion.

Oct 28 '24 09:10 vaidehi411

@vaidehi411: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Oct 28 '24 11:10 openshift-ci[bot]

/lgtm

Oct 29 '24 12:10 iamkirkbater

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chamalabey, iamkirkbater, vaidehi411

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [iamkirkbater]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 29 '24 12:10 openshift-ci[bot]

managed-cluster-config managed-cluster-config copied to clipboard

OSD-25331 - updated prom queries for KubeNodeUnschedulableSRE

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

managed-cluster-config
managed-cluster-config copied to clipboard