smartctl_exporter
smartctl_exporter copied to clipboard
Add device type support
Allow passing a custom -d
/ --device=
flag to smartctl. The default is the same (auto
) as upstream smartctl.
Fixes: https://github.com/prometheus-community/smartctl_exporter/issues/26
Signed-off-by: SuperQ [email protected]
I think it can (and in some cases, must?) be passed multiple times. sometimes one per drive?
According to the docs, it can only set once per device. But, given that we loop over the devices, it does make sense that we allow setting it differently per device.
We could do something like parse the device like --smartctl.device=/dev/foo;type
. For example, --smartctl.device=/dev/sda;cciss,0
.
@SuperQ As already suggested by you, this change should be adapted so that it is possible to set the device type per --smartctl.device
parameter. This is for example to support the RAID controller case where only the logical device is visible to the OS and the actual physical devices behind this logical device can only be accessed via the smartctl --device=TYPE
parameter.
Example: smartctl
usage
smartctl --all --device=cciss,0 /dev/sda
smartctl --all --device=cciss,1 /dev/sda
The part where the result of smartctl --scan
is used should also be updated. For some RAID controllers this scanning also returns the device type which is available as the type
property in JSON. For these controllers specifying the devices manually then would not be required anymore.
Example: smartctl --scan
output
/dev/sda -d scsi
/dev/sdb -d scsi
/dev/bus/0 -d megaraid,0
/dev/bus/0 -d megaraid,1
/dev/bus/0 -d megaraid,2
/dev/bus/0 -d megaraid,3
@darxriggs thank you for good explanation!
Can you also record and paste (as .txt
attachment) all json outputs from smartctl
? For developer is very (almost impossible) to develop without test data
Example outputs from smartctl
NVMe
RAID
Looking at the currently generated metrics and this change, it becomes apparent that a third detail should be adapted too - the labels.
Example
# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="sda"} 1
Now that device
is not unique anymore with this change (it will be possible to specify the same device but with different types) an additional label is required to identify the actual device. Either an equivalent of the type
or info_name
from the JSON output could be used.
Example
# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="sda","type"="cciss,0"} 1
smartctl_device_smart_status{device="sda","type"="cciss,1"} 1
Is that really true? I thought device still needed to be something under /dev, so should be unique still?
@kfox1111 Let me refer you to the man page of smartctl's --device
option. It describes all the supported types and use cases in more detail.
A clear distinction must be made between the positional device
argument and the named --device
option (contained as type
property in the JSON) of smartctl and the device
label of smartctl_exporter.
Cool. thanks for the info. :)
Followup to @darxriggs ' notes above:
It is entirely possible and not that uncommon for a given system to present both LSI HBA virtual devices and passed-through individual drives.
Eg. the first two drives are mirrored with the HBA and need -d megaraid,0
, -d megaraid,1
, but the balance of drives are passed through and are accessed as /dev/sdXX
.
On at least some HBAs with some personality / mode settings, any physical drives that aren't part of a VD are passed through.
While I personally am not a fan of HBA RAID or RoC HBAs in general, we're still stuck with legacies, and some organizations embrace HBA RAID.
Ideally the exporter would iterate smartctl
invocations over the results from smartctl --scan
, invoking as necessary for each entry. Having to hardcode devices and access flags in a config file or on the commandline would be rather inconvenient and prone to errors -- that would require external scripting to discover devices and keep that mapping updated across inventory changes.
To the note about adding a type
label for, let's call them occulted devices
, I get where the suggested scheme is coming from. I don't think that's the ideal approach, though. type
as a label is IMHO a misnomer, this isn't a type but rather a subunit in the vein of a LUN. I'd rather see something like device=sda,0
or device=sda.0
or device=megaraid0.0
. It would be awkward to have to correlate labels for some but not all metrics, or to have to devices complex relabeling rules at Prometheus ingest.
Notes:
- Dell's PERC HBAs are rebadged LSI.
- A system may contain multiple HBAs of the same or different type, one might argue that metrics should reflect the HBA to which a drive is attached.
Also, to be clear, the positional /dev/xx
parameter to storcli
is not actually used, so it could be anything. Arguably then the device
label should be cciss,0
, cciss,1
, megaraid,4
, etc. I'm currently using a customized fork of someone's smartmon.sh
and jumping through hoops to make these labels useful -- and aligned with drives that are not handicapped by being under an RoC volume.
Can we figure out how to make this more scalable & managable?
Example: a 36-disk supermicro chassis or a 60-disk BackBlaze pod.
Concerns:
- naively, this would lead to a very long commandline!
- changing the parameters to update a list of drives means restarting the exporter; could this be avoided w/ a config file that hot-reloads?
- This would bringing back the config file that was removed in 0.8.0
- Are there any controllers where the identifier is not stable between boots (e.g. the drives are not numbered by slot, but rather by enumeration)
- Are there multipath SAS controllers&disks, where a single disk shows up twice the same system?
@SuperQ Do you mind if I make this a feature branch so others can work on it and iron out the details? I don't have any devices for testing.
Any updates?
At this point I'm just going to have my protege write a SMART collector from scratch in Python. This is a pretty glaring omission on the part of smartctl_exporter.
@anthonyeleven Could your protege update this pr instead? It seems like the pr is close, but the original developer ran out of time to finish it?
He's not a strong golang coder.
I made changes for fix this. https://github.com/prometheus-community/smartctl_exporter/pull/205
This PR was promising, it seemed like a simple solution to a big problem with cciss
type devices.
We've given up on smartctl_exporter for now and have instead been forking storcli.py