zos
zos copied to clipboard
Report broken disks back to farmers
This issue is made as a follow-up from the cache issue (Cache issue #565).
When the node is not able to mount /var/cache on an SSD or HDD present on the node, the /var/cache subvolume is mounted as a tmpfs in memory. There needs to be a way to report back a broken SSD and/or HDD to the farmer (who owns the node).
Grafana has a built-in feature to make an alert (which can send an email) based upon a specific log message from the node. The email will alert the farmer that a node has problems with it's HRU/SRU.
As @muhamadazmy pointed out, there should probably only be a report when the HRU and SRU are not in "healthy" state. When MRU or CRU fails, the machine will (probably) not boot and thus not make a connection with the BCDB, so there will be no logs sent to Grafana.
Update: as we use a custom Grafana plug-in for the logs, we can't add alerts. I'll see if it's possible to do it with Loki.
We should create a generic error model that can be inserted into BCDB. Where as the threebot can fetch these logs and lets the farmer know about them.
We could integrate this properly in grid 3.0