az-hop
az-hop copied to clipboard
Disk usage warning prevents new nodes from starting
Version
v1.0.35
In what area(s)?
/area monitoring
Expected Behavior
When /anfhome becomes full, I would expect the admin of the cluster to be notified via email, and users to be notified on the command line when they submit new jobs.
Actual Behavior
When /anfhome is >90% full [1], a node health check fails and new jobs are stuck in "configure" stage forever. Users without access to the CycleCloud dashboard have no way of knowing why this is happening.
[message="ERROR : Node Health Checks failed - hcl-pg0-22 - BLZ221031015025 - ERROR: nhc: Health check failed: check_fs_used: /anfhome is 90% full (3822928128kB), threshold is 90%";priority="high";level="error"]
Steps to Reproduce the Problem
Fill /anfhome to 90% and try submitting jobs
[1] By the way, it appears this value of 90% is hardcoded (?), at least it does not reflect the value of alerting.local_volume_threshold: 80
or anf.alert_threshold: 80
from my config.yml file
Please see how to configure monitoring and alerts https://azure.github.io/az-hop/operate/alerting.html This is only available in the Terraform deployment, need help to port it on bicep.
You need :
- to enable log analytics workspace or use an existing one
- enable alerting with alerting.enabled=true and set the admin_email
Thanks Xavier for the pointers on how to enable alerts to the admin. Actually, both settings are enabled in my case and the log analytics workspace exists, but no alerts are set up
Perhaps my deployment recipes are outdated and this is fixed in later versions; I currently can't touch the system.
Even when proper admin alerts were setup, I still wonder whether preventing new nodes from starting for disk >90% is the right approach... I guess the idea is that you should never actually reach that point? If there was a way to forward this information to the user that would be very helpful - it can always happen that an admin cannot react for some time
the purpose of the alert is to not reach that point for sure. The best would be to have this in an alias email instead of a single admin. The grafana dashboard is also providing a way to monitor the diskspace of mounts and infra VMs, but without alerts.