zos icon indicating copy to clipboard operation
zos copied to clipboard

Report broken disks back to farmers

Open maximevanhees opened this issue 5 years ago • 3 comments

This issue is made as a follow-up from the cache issue (Cache issue #565).

When the node is not able to mount /var/cache on an SSD or HDD present on the node, the /var/cache subvolume is mounted as a tmpfs in memory. There needs to be a way to report back a broken SSD and/or HDD to the farmer (who owns the node).

Grafana has a built-in feature to make an alert (which can send an email) based upon a specific log message from the node. The email will alert the farmer that a node has problems with it's HRU/SRU.

As @muhamadazmy pointed out, there should probably only be a report when the HRU and SRU are not in "healthy" state. When MRU or CRU fails, the machine will (probably) not boot and thus not make a connection with the BCDB, so there will be no logs sent to Grafana.

maximevanhees avatar Mar 10 '20 14:03 maximevanhees

Update: as we use a custom Grafana plug-in for the logs, we can't add alerts. I'll see if it's possible to do it with Loki.

maximevanhees avatar Mar 10 '20 14:03 maximevanhees

We should create a generic error model that can be inserted into BCDB. Where as the threebot can fetch these logs and lets the farmer know about them.

DylanVerstraete avatar Mar 16 '20 14:03 DylanVerstraete

We could integrate this properly in grid 3.0

DylanVerstraete avatar Mar 04 '21 15:03 DylanVerstraete