smartctl_exporter
smartctl_exporter copied to clipboard
Failed to collect raid controller device S.M.A.R.T data
I tried to collect data from a server with a raid controller through smartctl exporter.
However, an error occurred as below.
How can i collect S.M.A.R.T data on raid controller devices?
Yes, this is the real reason why you need such a service in the first place - to monitor devices that are not easily visible inside the operating system.
If the device type was able to be retrieved and passed into function readSMARTctl then this could be used with the --device flag and would be a safer way of being able to scan all device types. EG as below :
smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --device megaraid,0 /dev/bus/1
@marpears I can read the device info with smartctl including the device option, but NOT with smartctl_exporter...
$ smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --device cciss,1 /dev/sdb
{
"json_format_version": [
1,
0
],
"smartctl": {
"version": [
7,
3
],
"svn_revision": "5338",
"platform_info": "x86_64-linux-3.10.0-957.27.2.el7.x86_64",
"build_info": "(local build)",
"argv": [
"smartctl",
"--json",
...
but this doesn't work:
$ smartctl_exporter --smartctl.device='cciss,1 /dev/sdb'
ts=2022-12-02T09:40:45.718Z caller=main.go:90 level=info msg="Starting smartctl_exporter" version="(version=0.9.1, branch=HEAD, revision=a58c632ea8fa0f4f10a9ac9e941e610a7bb2efc1)"
ts=2022-12-02T09:40:45.718Z caller=main.go:91 level=info msg="Build context" build_context="(go=go1.19.3, user=root@fa2a9a938fb5, date=20221106-21:46:18)"
ts=2022-12-02T09:40:45.735Z caller=main.go:112 level=warn msg="Device unavailable" name="cciss,1 /dev/sdb"
ts=2022-12-02T09:40:45.735Z caller=main.go:119 level=info msg="No devices specified, trying to load them automatically"
ts=2022-12-02T09:40:45.735Z caller=main.go:124 level=error msg="No devices found"
@josefzahner The --smartctl.device
flag in smartctl_exporter does not translate to the --device
flag of smartctl. The exporter expects just the /dev/
node path. Also note that --device cciss,1 /dev/sdb
are 3 distinct flags passed on the command line, you can't pass all of that to --smartctl.device
.
how does one configure cciss,1? I need to do it on some of my nodes and have not found a way yet.
This is a gating factor for me too. I've added comments to the above issue and linked PR.
This is also an issue for me. I guess a proper solution would involve adding a separate flag to provide extra flags for smartctl
.
The tool should discover such HBAs and do so automagically at per-device granularity, since there can and will be a mixed population of direct-attach, passthrough, and hidden-by-VD drives on various sytems and especially within a given system.
smartmon.sh
for example does this:
for device in ${device_list}; do
disk="$(echo ${device} | cut -f1 -d'|')"
type="$(echo ${device} | cut -f2 -d'|')"
active=1
echo "smartctl_run{disk=\"${disk}\",type=\"${type}\"}" "$(TZ=UTC date '+%s')"
# Check if the device is in a low-power mode
$SMARTCTL -n standby -d "${type}" "${disk}" > /dev/null || active=0
echo "device_active{disk=\"${disk}\",type=\"${type}\"}" "${active}"
# Skip further metrics to prevent the disk from spinning up
test ${active} -eq 0 && continue
# Get the SMART information and health
$SMARTCTL -i -H -d "${type}" "${disk}" | parse_smartctl_info "${disk}" "${type}"
# Get the SMART attributes
case ${type} in
sat) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_attributes "${disk}" "${type}" ;;
sat+megaraid*) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_attributes "${disk}" "${type}" ;;
scsi) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_scsi_attributes "${disk}" "${type}" ;;
nvme) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_nvme_attributes "${disk}" "${type}" ;;
megaraid*) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_scsi_attributes "${disk}" "${type}" ;;
*)
echo "disk type is not sat, scsi or megaraid but ${type}"
exit
;;
esac
done | format_output```
Mind you, I *despise* RoC HBAs and would just as soon never have one, or to set passthrough/JBOD on legacy systems, but walking into an existing deployment of thousands I don't have the luxury of greenfield.
@jakubgs It's more than just extra flags, it's discovery too.
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device```
I no longer have HP HBAs, but it would be polite for however this is done to be architected in such a way that they could be supported later.
I hope to sunset RoC VDs through attrition, but that will take years :-/
Any way to do this yet?
I’d do it myself if I had the coding skills. It really is a fatal flaw. Mind you HBA RAID is itself a fatal flaw but Dell’s BOSS-N1 is too useful, though one has to invoke ‘mvcli’ to get status. On Jul 14, 2023, at 5:10 PM, kfox1111 @.***> wrote: Any way to do this yet?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
I did a bit of research into this and found out that these devices can be found with smartctl
by using -d scsi
:
> smartctl --json --scan | jq -c '.devices[] | { name, protocol }'
jq: error (at <stdin>:21): Cannot iterate over null (null)
> smartctl --json --scan --device scsi | jq -c '.devices[] | { name, protocol }'
{"name":"/dev/sda","protocol":"SCSI"}
{"name":"/dev/sdb","protocol":"SCSI"}
{"name":"/dev/sdc","protocol":"SCSI"}
But there might be an even better way to identify those devices, and that is lsblk
:
> lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/sda","hctl":"0:1:0:0","subsystems":"block:scsi:pci"}
{"path":"/dev/sdb","hctl":"0:1:0:1","subsystems":"block:scsi:pci"}
{"path":"/dev/sdc","hctl":"0:1:0:2","subsystems":"block:scsi:pci"}
As we can see the hctl
field informs us what number to use for --device cciss,N
and sybsystems
informs us that scsi
is being used, which together can be a pretty reliable heuristic for detecting HBA.
And different host without HBA:
> lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/nvme0n1","hctl":null,"subsystems":"block:nvme:pci"}
{"path":"/dev/nvme1n1","hctl":null,"subsystems":"block:nvme:pci"}
I don't know what maintainers would think about using a tool other than systemctl
for discovery, but this is a pretty standard tool available in most system, and we could still have a fallback to smartctl
if unavailable.
I'm going to read a bit the code to see how difficult this would be.
Main issue as far as I can tell is that even if you discover the devices, often you won't get much info from them:
{
"json_format_version": [1, 0],
"smartctl": {
"version": [7, 2],
"svn_revision": "5155",
"platform_info": "x86_64-linux-5.15.0-79-generic",
"build_info": "(local build)",
"argv": ["smartctl", "-A", "--device", "cciss,1", "/dev/sdb", "--json"],
"exit_status": 0
},
"device": {
"name": "/dev/sdb",
"info_name": "/dev/sdb [cciss_disk_01] [SCSI]",
"type": "cciss",
"protocol": "SCSI"
},
"temperature": {
"current": 21,
"drive_trip": 70
},
"power_on_time": {
"hours": 47138,
"minutes": 5
},
"scsi_grown_defect_list": 0
}
Temperature and power-on time... not great.
Better than nothing, but yeah. I haven't had an HP HBA to work with for years, but re the scsi
factor above, is the subject drive SAS? I would not be surprised if this would not surface SATA (but it might).
I'm increasingly leaning toward having a protege write a SMART harvester from scratch in Python, which would make it easier to normalize the vagaries of data that smartctl
gives us. Then redirect the output into a file and let node_exporter's
textfile
collector snarf it up.
Personally I'd rather fix what we have working than try from scratch. I'm busy enough dealing with what I have working already have the time to reinvent wheels. Even if I get just temp and power-on hours that's better than deployed SMART exporter just failing at startup and Prometheus returning alerts for the downed service.
But your point about SATA/SAS is well made. I will have to check how that is done on my servers.