Overriding The PrometheusRule Objects Alerts

Open guyst16 opened this issue 2 years ago • 1 comments

1. Quick Debug Checklist

Run on OpenShift v4.10.16
GPU operator version v22.9.1

2. Issue or feature description

The GPU operator currently arrives with 2 PrometheusRule objects: nvidia-gpu-operator-metrics and nvidia-node-status-exporter-alerts. All the alerts in the PrometheusRules objects I mentioned are with the label severity: warning. In my Grafana I use a dashboard which contains only high or critical alerts so I tried to increase the severity of the alerts but the operator's ClusterPolicy object is the one who manages the PromethuesRule objects and after I applied the changes it reverted it.

Is there any best practice for overriding/changing the PrometheusRule objects' labels?

2. Steps to reproduce the issue

Run: oc edit prometheusrule nvidia-gpu-operator-metrics -n nvidia-gpu-operator
Replace any label of severity: warning to severity: high
Save & exit
Wait till the object ClusterPolicy will return the original configuration for the PrometheusRule

May 08 '23 08:05 guyst16

@guyst16 currently we don't support changing these but you can create custom rules based on the ones provided by the operator. Will also look into allowing this change with the operator.

May 18 '23 14:05 shivamerla