smartctl_exporter Add device type support

Allow passing a custom -d / --device= flag to smartctl. The default is the same (auto) as upstream smartctl.

Fixes: https://github.com/prometheus-community/smartctl_exporter/issues/26

Jan 10 '23 19:01 SuperQ

I think it can (and in some cases, must?) be passed multiple times. sometimes one per drive?

Jan 10 '23 20:01 kfox1111

According to the docs, it can only set once per device. But, given that we loop over the devices, it does make sense that we allow setting it differently per device.

We could do something like parse the device like --smartctl.device=/dev/foo;type. For example, --smartctl.device=/dev/sda;cciss,0.

Jan 10 '23 21:01 SuperQ

@SuperQ As already suggested by you, this change should be adapted so that it is possible to set the device type per --smartctl.device parameter. This is for example to support the RAID controller case where only the logical device is visible to the OS and the actual physical devices behind this logical device can only be accessed via the smartctl --device=TYPE parameter.

Example: smartctl usage

smartctl --all --device=cciss,0 /dev/sda
smartctl --all --device=cciss,1 /dev/sda

The part where the result of smartctl --scan is used should also be updated. For some RAID controllers this scanning also returns the device type which is available as the type property in JSON. For these controllers specifying the devices manually then would not be required anymore.

Example: smartctl --scan output

/dev/sda -d scsi 
/dev/sdb -d scsi 
/dev/bus/0 -d megaraid,0 
/dev/bus/0 -d megaraid,1 
/dev/bus/0 -d megaraid,2 
/dev/bus/0 -d megaraid,3

Jan 26 '23 20:01 darxriggs

@darxriggs thank you for good explanation! Can you also record and paste (as .txt attachment) all json outputs from smartctl? For developer is very (almost impossible) to develop without test data

Jan 27 '23 02:01 k0ste

Example outputs from smartctl

NVMe

RAID

Jan 27 '23 12:01 darxriggs

Looking at the currently generated metrics and this change, it becomes apparent that a third detail should be adapted too - the labels.

Example

# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="sda"} 1

Now that device is not unique anymore with this change (it will be possible to specify the same device but with different types) an additional label is required to identify the actual device. Either an equivalent of the type or info_name from the JSON output could be used.

Example

# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="sda","type"="cciss,0"} 1
smartctl_device_smart_status{device="sda","type"="cciss,1"} 1

Jan 27 '23 13:01 darxriggs

Is that really true? I thought device still needed to be something under /dev, so should be unique still?

Jan 27 '23 17:01 kfox1111

@kfox1111 Let me refer you to the man page of smartctl's --device option. It describes all the supported types and use cases in more detail.

A clear distinction must be made between the positional device argument and the named --device option (contained as type property in the JSON) of smartctl and the device label of smartctl_exporter.

Jan 27 '23 18:01 darxriggs

Cool. thanks for the info. :)

Jan 27 '23 18:01 kfox1111

Followup to @darxriggs ' notes above:

It is entirely possible and not that uncommon for a given system to present both LSI HBA virtual devices and passed-through individual drives.

Eg. the first two drives are mirrored with the HBA and need -d megaraid,0, -d megaraid,1, but the balance of drives are passed through and are accessed as /dev/sdXX.

On at least some HBAs with some personality / mode settings, any physical drives that aren't part of a VD are passed through.

While I personally am not a fan of HBA RAID or RoC HBAs in general, we're still stuck with legacies, and some organizations embrace HBA RAID.

Ideally the exporter would iterate smartctl invocations over the results from smartctl --scan, invoking as necessary for each entry. Having to hardcode devices and access flags in a config file or on the commandline would be rather inconvenient and prone to errors -- that would require external scripting to discover devices and keep that mapping updated across inventory changes.

To the note about adding a type label for, let's call them occulted devices, I get where the suggested scheme is coming from. I don't think that's the ideal approach, though. type as a label is IMHO a misnomer, this isn't a type but rather a subunit in the vein of a LUN. I'd rather see something like device=sda,0 or device=sda.0 or device=megaraid0.0. It would be awkward to have to correlate labels for some but not all metrics, or to have to devices complex relabeling rules at Prometheus ingest.

Notes:

Dell's PERC HBAs are rebadged LSI.
A system may contain multiple HBAs of the same or different type, one might argue that metrics should reflect the HBA to which a drive is attached.

Feb 27 '23 18:02 anthonyeleven

Also, to be clear, the positional /dev/xx parameter to storcli is not actually used, so it could be anything. Arguably then the device label should be cciss,0, cciss,1, megaraid,4, etc. I'm currently using a customized fork of someone's smartmon.sh and jumping through hoops to make these labels useful -- and aligned with drives that are not handicapped by being under an RoC volume.

Jul 13 '23 18:07 anthonyeleven

Can we figure out how to make this more scalable & managable?

Example: a 36-disk supermicro chassis or a 60-disk BackBlaze pod.

Concerns:

naively, this would lead to a very long commandline!
changing the parameters to update a list of drives means restarting the exporter; could this be avoided w/ a config file that hot-reloads?
This would bringing back the config file that was removed in 0.8.0
Are there any controllers where the identifier is not stable between boots (e.g. the drives are not numbered by slot, but rather by enumeration)
Are there multipath SAS controllers&disks, where a single disk shows up twice the same system?

Oct 16 '23 18:10 robbat2

@SuperQ Do you mind if I make this a feature branch so others can work on it and iron out the details? I don't have any devices for testing.

Oct 25 '23 08:10 NiceGuyIT

Any updates?

Feb 28 '24 14:02 zxzharmlesszxz

At this point I'm just going to have my protege write a SMART collector from scratch in Python. This is a pretty glaring omission on the part of smartctl_exporter.

Feb 28 '24 14:02 anthonyeleven

@anthonyeleven Could your protege update this pr instead? It seems like the pr is close, but the original developer ran out of time to finish it?

Feb 28 '24 15:02 kfox1111

He's not a strong golang coder.

Feb 28 '24 22:02 anthonyeleven

I made changes for fix this. https://github.com/prometheus-community/smartctl_exporter/pull/205

Mar 01 '24 10:03 zxzharmlesszxz

This PR was promising, it seemed like a simple solution to a big problem with cciss type devices.

May 28 '24 09:05 jakubgs

We've given up on smartctl_exporter for now and have instead been forking storcli.py

May 28 '24 13:05 anthonyeleven

smartctl_exporter smartctl_exporter copied to clipboard

Add device type support

smartctl_exporter
smartctl_exporter copied to clipboard