snclient icon indicating copy to clipboard operation
snclient copied to clipboard

Add another check function: check_drive_health

Open inqrphl opened this issue 2 months ago • 0 comments

This check uses the smartctl executable installed in systems. The smartctl tool has the capability to print out json based outputs.

Smartctl Outputs

I could not find certain schemas about the JSON outputs, so I had to discover and write my own. The most important output is taken with the --xall parameter, which reports everything it can. The fields it has depends on the disk type, model , vendor etc. so I gathered different outputs and wrote the types as best as I could.

One important point is to discern if a field is not there, which is used to see if a test is ongoing / done. This is done by specifying that field as a pointer to a type, and not directly as an instance of a type. Otherwise go default initializes everything possible in the schema, and the code cannot understand if its a real parsed value or a default initialized value.

With the parsing working, three smartctl functions were implemented

  • SmartctlScanOpen uses the --scan-open argument to discover disks that smartctl can open
  • SmartctlXall uses the --xall argument to discover everything about the disk
  • SmartctlStartTest starts a smart test. Different types of tests are available for each disk. Once a scan is started, the smartctl returns immediately, as longer tests can take hours
  • SmartctlTestAndAwaitCompletion starts a test and awaits until it is complete, querying the stats with SmartctlXall periodically. Waiting logic depends on the type of the disk.

Check logic

The check_drive_health uses a test, offline, short, long, conveyance or selective_test. An offline test which just updates the smart attributes smart attributes etc. The tests are all started and awaited in different goroutines for each disk. Afterwards, the results are collected and interpreted. This logic worked when tested on a nvme and ata drive

There are two important values, test result from the latest test and smart health status . Latest test result is extracted from the json depending on the disk type. The smart health status is determined by the firmware depending, which compares the smart attributes and marks if one of them are inside failure bounds.

These two values are checked, and depending if they pass/fail the severity is set to critical or warning.

TODO

  • [] Add the test results and health results as a performance metric. I could not do it yet
  • [] Add documentation file
  • [] Test the same check on scsi drive
  • [] Perform the check on multiple drives at once

inqrphl avatar Nov 12 '25 14:11 inqrphl