smartctl_exporter icon indicating copy to clipboard operation
smartctl_exporter copied to clipboard

Metrics for SMART logs are no longer collected

Open lahwaacz opened this issue 3 years ago • 10 comments

The --xall parameter was removed in https://github.com/prometheus-community/smartctl_exporter/commit/e88442081c18fcac9025e4d806786e12c791cc10, but no --log parameter was added. Hence, the smartctl_device_num_err_log_entries, smartctl_device_error_log_count, smartctl_device_self_test_log_count, and smartctl_device_self_test_log_error_count metrics stay empty as the smartctl does not report the relevant data to the exporter.

lahwaacz avatar Oct 30 '22 14:10 lahwaacz

Adding --log=error --log=selftest Seems to fix this, tested it on master, but i don't know if this wakes up drives, it shouldn't according to doc.

If someone wants to test is, execute this on a device that is in sleep status: smartctl --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error --log=selftest <device>

--format=brief is quite redundant with --json

tekert avatar Jul 23 '23 14:07 tekert

Yeah, when we have a drive that fails a self test, it seems that without --log=selftest, there's no way for us to know that an otherwise fine drive has had any problem.

At the very least, in cases like this, it would be nice to get a smartctl.exit_status of something other than zero.

With --log=selftest included, we get an exit_status of 128.

koebbe avatar Oct 02 '23 15:10 koebbe

Even with --nocheck=never, on a sample drive that's loaded to 100% IO, smartctl returns different output with & without the --xall command..

We need to bring back the --xall to get correct information to get the fields populated.

Here's the JSON with and without --xall using --nocheck=never. (changing nocheck in this case doesn't have an effect , this drive is never idle due to it's workload). Diff included for ease of review.

smartctl-_dev_sdb-info.health.attributes.tolerance_verypermissive.nocheck_never.format_brief.log_error.xall.json smartctl-_dev_sdb-info.health.attributes.tolerance_verypermissive.nocheck_never.format_brief.log_error.json

robbat2 avatar Oct 16 '23 16:10 robbat2

smartctl-xall-json.patch.gz Sorry, GitHub would not let me attach the patch unless I compressed it.

robbat2 avatar Oct 16 '23 17:10 robbat2

@SuperQ @NiceGuyIT should we consider this a smartctl bug or a tradeoff we have to ask users to make?

If users want these metrics, they have to consider that the metrics might wake a drive and prevent idle.

robbat2 avatar Dec 07 '23 17:12 robbat2

@robbat2 I hope to do a deep dive into this and a few other issues before or around the holidays. This issue might be related to #152 which was caused by PR #131 that introduced --log=error. Since you added the Python script to save a redacted version of smartctl, I was going to modify that to compare the difference between the smartctl switches so that we can make a logical step forward. I'd rather not play wack-a-mole with the smartctl switches. If there's a tradeoff, it can be documented and left up to the user, while at the same time reported upstream to see what Smartmontools thinks.

NiceGuyIT avatar Dec 12 '23 02:12 NiceGuyIT

OK, so I was directed here from #190. Seeing as this issue is 1.5 years old, what is the verdict here? As it stands, smartctl_exporter as a project is more or less useless because it fails to collect most of the actually interesting metrics.

intelfx avatar Apr 27 '24 01:04 intelfx

@NiceGuyIT did you make any progress on it? On all of the drives I tried, it seems there's less data without --xall. I think we should introduce an exporter option like --wake-drives-for-more-data that enables the --xall option to smartctl, and then the output will be fine. Just document it as a potentially waking drives (most of the fleet I care about is never idle anyway).

robbat2 avatar Apr 28 '24 22:04 robbat2