Harvest Health != ONTAP GUI
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please let us know in a comment
Problem
Hi,
I had opened couple years ago a GitHub issue that ONTAP Web UI showed alerts that Harvest was not showing. I was told the new collector will be introduced to collect the data (I think Its called EMS)
We've had EMS collector enabled for awhile now and we just noticed that in ONTAP Web UI we have an alert that does not show up in Grafana Health dashboard.
Our collector:
agora:
datacenter: EQX
addr: agora-cluster.deutsche-boerse.de
auth_style: basic_auth
username: $__env{NETAPP_HARVEST_READONLY_USERNAME}
password: $__env{NETAPP_HARVEST_READONLY_PASSWORD}
use_insecure_tls: true
exporters:
- agora
collectors:
- Rest
- RestPerf
- Ems
ONTAP UI:
Grafana Dashboard:
In the Grafana panel popup it reads
" The EMS collector gathers EMS events as defined in your ems.yml file. This panel displays events with emergency severity that occurred within the selected time range. "
The way I understand it, in order to "recreate" ONTAP Web UI alerting, it would require user to recreate 1400+ definitions in the ems.yml ?
Essentially what we are trying to achieve is to use harvest as the ONLY source of metrics and alerts. However the suggested approach is maintenance overkill. We simply want to be alerted when ONTAP has an error without having to look at the Web UI.
We dont need to see the description of the event as in the ONTAP Web UI, but need to be made aware that there is an Alert (ie. not show on the Grafana Dashboard "0" issues)
Configuration
No response
Poller
agora poller
Version
latest
Poller logs
No response
OS and platform
docker
ONTAP or StorageGRID version
Netapp 9.13.1P8
Additional Context
No response
References
No response
@db-wally007 That's correct. The Emergency panel in the Grafana Health dashboard UI only displays EMS with a severity of "emergency" if they are defined in the ems.yaml file. The idea behind the EMS collector was to list only the relevant EMS in ems.yaml to avoid spam from listing all severity-based EMS. I noticed that SM shows EMS with severities of "emergency," "alert," and "error."
Currently, the only option is to add those EMS to the ems.yaml file. We will review this approach and update you.
@db-wally007 SM displays emergency events in the header and shows all alert, error, and emergency events in the table within the UI. As mentioned earlier, we don't intend to collect all events since some may be just noise. The idea is to selectively pick and choose events as needed. If we focus on emergency events, there are approximately 300 emergency events in ONTAP. Therefore, the suggestion is to list these specific events as needed in the ems.yaml file.
This makes no sense to me. Aren't all "emergency" events needed ? Essentially, this makes alerting in harvest a guess at best. Maybe I documented all alerts that might hit the storage appliance, maybe I didnt.
At the very least, harvest should collect number of alerts. Showing "0" alerts in the dashboard while storage is degraded is def. not something many should rely on.
@db-wally007 We'll get back to you on this. Thanks.
@db-wally007 Sorry for the delayed response. How about we implement the following approach:
We will introduce a new metric, health_ems_alerts, which will contain information about emergency events from the last 24 hours, similar to the system manager's metrics but independent of the EMS collector. The health dashboard will then consume this new metric instead of consuming metrics directly from the EMS collector.
Thank you for coming back.
Sounds great, however would it be possible to have 3 metrics or 1 metric with 3 different labels based on severity ?
health_ems_alerts_[emergency_alert_error]
or
health_ems_alerts{severity="emergency|alert|error}
@db-wally007 Yes, health_ems_alerts{severity="emergency|alert|error"} is how we will publish these alerts. By default, we will only collect emergency events from the last 24 hours. You can customize the collection for other severities through template changes.
@db-wally007 To utilize these metrics, you can upgrade to the nightly build. The health_ems_alerts metric is published by Harvest when the Rest Collector is enabled and is displayed in the Health dashboard. Below are some screenshots from the Health dashboard.
Please note that the structure of these metrics is such that the value represents the number of EMS messages. Additional parameters about these EMS messages are not published to avoid cardinality issues in Prometheus time series. By default, emergency EMS messages are collected for the last 24 hours. However, you can enable more by extending the template here.
health_ems_alerts{cluster="umeng-aff300-01-02", datacenter="REST", instance="dc1:12994", job="harvest", message="AccessCache.ReachedLimits", node="umeng-aff300-01", severity="emergency", source="notifyd"} | 1
health_ems_alerts{cluster="umeng-aff300-01-02", datacenter="REST", instance="dc1:12994", job="harvest", message="callhome.chassis.overtemp", node="umeng-aff300-01", severity="emergency", source="notifyd"} | 1
health_ems_alerts{cluster="umeng-aff300-01-02", datacenter="REST", instance="dc1:12994", job="harvest", message="callhome.client.app.emerg", node="umeng-aff300-01", severity="emergency", source="notifyd"} | 1
health_ems_alerts{cluster="umeng-aff300-01-02", datacenter="REST", instance="dc1:12994", job="harvest", message="flexcache.cacheDisconnected", node="umeng-aff300-01", severity="emergency", source="notifyd"} | 8
Verified in 24.11 ddb97c57