harvest icon indicating copy to clipboard operation
harvest copied to clipboard

Harvest Health != ONTAP GUI

Open db-wally007 opened this issue 1 year ago • 7 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please let us know in a comment

Problem

Hi,

I had opened couple years ago a GitHub issue that ONTAP Web UI showed alerts that Harvest was not showing. I was told the new collector will be introduced to collect the data (I think Its called EMS)

We've had EMS collector enabled for awhile now and we just noticed that in ONTAP Web UI we have an alert that does not show up in Grafana Health dashboard.

Our collector:

  agora:
    datacenter: EQX
    addr: agora-cluster.deutsche-boerse.de
    auth_style: basic_auth
    username: $__env{NETAPP_HARVEST_READONLY_USERNAME}
    password: $__env{NETAPP_HARVEST_READONLY_PASSWORD}
    use_insecure_tls: true
    exporters:
      - agora
    collectors:
      - Rest
      - RestPerf
      - Ems

ONTAP UI:

image

image

Grafana Dashboard:

image

In the Grafana panel popup it reads

" The EMS collector gathers EMS events as defined in your ems.yml file. This panel displays events with emergency severity that occurred within the selected time range. "

The way I understand it, in order to "recreate" ONTAP Web UI alerting, it would require user to recreate 1400+ definitions in the ems.yml ?

Essentially what we are trying to achieve is to use harvest as the ONLY source of metrics and alerts. However the suggested approach is maintenance overkill. We simply want to be alerted when ONTAP has an error without having to look at the Web UI.

We dont need to see the description of the event as in the ONTAP Web UI, but need to be made aware that there is an Alert (ie. not show on the Grafana Dashboard "0" issues)

Configuration

No response

Poller

agora poller

Version

latest

Poller logs

No response

OS and platform

docker

ONTAP or StorageGRID version

Netapp 9.13.1P8

Additional Context

No response

References

No response

db-wally007 avatar Sep 13 '24 06:09 db-wally007

@db-wally007 That's correct. The Emergency panel in the Grafana Health dashboard UI only displays EMS with a severity of "emergency" if they are defined in the ems.yaml file. The idea behind the EMS collector was to list only the relevant EMS in ems.yaml to avoid spam from listing all severity-based EMS. I noticed that SM shows EMS with severities of "emergency," "alert," and "error."

Currently, the only option is to add those EMS to the ems.yaml file. We will review this approach and update you.

rahulguptajss avatar Sep 13 '24 07:09 rahulguptajss

@db-wally007 SM displays emergency events in the header and shows all alert, error, and emergency events in the table within the UI. As mentioned earlier, we don't intend to collect all events since some may be just noise. The idea is to selectively pick and choose events as needed. If we focus on emergency events, there are approximately 300 emergency events in ONTAP. Therefore, the suggestion is to list these specific events as needed in the ems.yaml file.

rahulguptajss avatar Sep 25 '24 07:09 rahulguptajss

This makes no sense to me. Aren't all "emergency" events needed ? Essentially, this makes alerting in harvest a guess at best. Maybe I documented all alerts that might hit the storage appliance, maybe I didnt.

At the very least, harvest should collect number of alerts. Showing "0" alerts in the dashboard while storage is degraded is def. not something many should rely on.

db-wally007 avatar Sep 25 '24 08:09 db-wally007

@db-wally007 We'll get back to you on this. Thanks.

rahulguptajss avatar Sep 26 '24 05:09 rahulguptajss

@db-wally007 Sorry for the delayed response. How about we implement the following approach:

We will introduce a new metric, health_ems_alerts, which will contain information about emergency events from the last 24 hours, similar to the system manager's metrics but independent of the EMS collector. The health dashboard will then consume this new metric instead of consuming metrics directly from the EMS collector.

rahulguptajss avatar Oct 23 '24 14:10 rahulguptajss

Thank you for coming back.

Sounds great, however would it be possible to have 3 metrics or 1 metric with 3 different labels based on severity ?

health_ems_alerts_[emergency_alert_error]

or

health_ems_alerts{severity="emergency|alert|error}

db-wally007 avatar Oct 25 '24 20:10 db-wally007

@db-wally007 Yes, health_ems_alerts{severity="emergency|alert|error"} is how we will publish these alerts. By default, we will only collect emergency events from the last 24 hours. You can customize the collection for other severities through template changes.

rahulguptajss avatar Oct 29 '24 05:10 rahulguptajss

@db-wally007 To utilize these metrics, you can upgrade to the nightly build. The health_ems_alerts metric is published by Harvest when the Rest Collector is enabled and is displayed in the Health dashboard. Below are some screenshots from the Health dashboard.

Please note that the structure of these metrics is such that the value represents the number of EMS messages. Additional parameters about these EMS messages are not published to avoid cardinality issues in Prometheus time series. By default, emergency EMS messages are collected for the last 24 hours. However, you can enable more by extending the template here.

health_ems_alerts{cluster="umeng-aff300-01-02", datacenter="REST", instance="dc1:12994", job="harvest", message="AccessCache.ReachedLimits", node="umeng-aff300-01", severity="emergency", source="notifyd"} | 1
health_ems_alerts{cluster="umeng-aff300-01-02", datacenter="REST", instance="dc1:12994", job="harvest", message="callhome.chassis.overtemp", node="umeng-aff300-01", severity="emergency", source="notifyd"} | 1
health_ems_alerts{cluster="umeng-aff300-01-02", datacenter="REST", instance="dc1:12994", job="harvest", message="callhome.client.app.emerg", node="umeng-aff300-01", severity="emergency", source="notifyd"} | 1
health_ems_alerts{cluster="umeng-aff300-01-02", datacenter="REST", instance="dc1:12994", job="harvest", message="flexcache.cacheDisconnected", node="umeng-aff300-01", severity="emergency", source="notifyd"} | 8
image image

rahulguptajss avatar Oct 30 '24 15:10 rahulguptajss

Verified in 24.11 ddb97c57

cgrinds avatar Nov 04 '24 16:11 cgrinds