az-hop icon indicating copy to clipboard operation
az-hop copied to clipboard

Disk usage warning prevents new nodes from starting

Open ltalirz opened this issue 1 year ago • 3 comments

Version

v1.0.35

In what area(s)?

/area monitoring

Expected Behavior

When /anfhome becomes full, I would expect the admin of the cluster to be notified via email, and users to be notified on the command line when they submit new jobs.

Actual Behavior

When /anfhome is >90% full [1], a node health check fails and new jobs are stuck in "configure" stage forever. Users without access to the CycleCloud dashboard have no way of knowing why this is happening.

[message="ERROR : Node Health Checks failed - hcl-pg0-22 - BLZ221031015025 - ERROR:  nhc:  Health check failed:  check_fs_used:  /anfhome is 90% full (3822928128kB), threshold is 90%";priority="high";level="error"]
image

Steps to Reproduce the Problem

Fill /anfhome to 90% and try submitting jobs

[1] By the way, it appears this value of 90% is hardcoded (?), at least it does not reflect the value of alerting.local_volume_threshold: 80 or anf.alert_threshold: 80 from my config.yml file

ltalirz avatar Oct 10 '23 10:10 ltalirz

Please see how to configure monitoring and alerts https://azure.github.io/az-hop/operate/alerting.html This is only available in the Terraform deployment, need help to port it on bicep.

You need :

  • to enable log analytics workspace or use an existing one
  • enable alerting with alerting.enabled=true and set the admin_email

xpillons avatar Oct 10 '23 10:10 xpillons

Thanks Xavier for the pointers on how to enable alerts to the admin. Actually, both settings are enabled in my case and the log analytics workspace exists, but no alerts are set up

image

Perhaps my deployment recipes are outdated and this is fixed in later versions; I currently can't touch the system.

Even when proper admin alerts were setup, I still wonder whether preventing new nodes from starting for disk >90% is the right approach... I guess the idea is that you should never actually reach that point? If there was a way to forward this information to the user that would be very helpful - it can always happen that an admin cannot react for some time

ltalirz avatar Oct 10 '23 10:10 ltalirz

the purpose of the alert is to not reach that point for sure. The best would be to have this in an alias email instead of a single admin. The grafana dashboard is also providing a way to monitor the diskspace of mounts and infra VMs, but without alerts.

xpillons avatar Oct 11 '23 10:10 xpillons