gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Overriding The PrometheusRule Objects Alerts

Open guyst16 opened this issue 2 years ago • 1 comments

1. Quick Debug Checklist

  • Run on OpenShift v4.10.16
  • GPU operator version v22.9.1

2. Issue or feature description

The GPU operator currently arrives with 2 PrometheusRule objects: nvidia-gpu-operator-metrics and nvidia-node-status-exporter-alerts. All the alerts in the PrometheusRules objects I mentioned are with the label severity: warning. In my Grafana I use a dashboard which contains only high or critical alerts so I tried to increase the severity of the alerts but the operator's ClusterPolicy object is the one who manages the PromethuesRule objects and after I applied the changes it reverted it.

Is there any best practice for overriding/changing the PrometheusRule objects' labels?

2. Steps to reproduce the issue

  1. Run: oc edit prometheusrule nvidia-gpu-operator-metrics -n nvidia-gpu-operator
  2. Replace any label of severity: warning to severity: high
  3. Save & exit
  4. Wait till the object ClusterPolicy will return the original configuration for the PrometheusRule

guyst16 avatar May 08 '23 08:05 guyst16

@guyst16 currently we don't support changing these but you can create custom rules based on the ones provided by the operator. Will also look into allowing this change with the operator.

shivamerla avatar May 18 '23 14:05 shivamerla