cockpit
cockpit copied to clipboard
storage: SMART support
Initial SMART self test support
Uses data available in UDisks http://storaged.org/doc/udisks2-api/latest/gdbus-org.freedesktop.UDisks2.Drive.Ata.html
https://issues.redhat.com/browse/COCKPIT-935
For now I added a new card to the disk detail which shows basic information and allows to run/abort self test.
Note that for now only (S)ATA devices are supported by Udisks, NVMe support will be added in the latest release https://github.com/storaged-project/udisks/releases/tag/udisks-2.10.0
@garrett I believe you wanted to work on mockups for this, any chance you could look into proper design?
Creating tests for this would be great, maybe udisks already has a way to mock it?
Yes, it does. SmartUpdate() has an atasmart_blob attribute where you can load a previously skdump --save'd blob. Of course you can also edit this in between, to have a few scenarios (passing and failing).
I am not sure how to test this. In order to run SmartUpdate udisks needs to expose org.freedesktop.UDisks2.Drive.Ata which is only available when the drive is (S)ATA and is likely to support SMART.
I am not sure how to test this. In order to run
SmartUpdateudisks needs to exposeorg.freedesktop.UDisks2.Drive.Atawhich is only available when the drive is (S)ATA and is likely to support SMART.
Neither does udisks test it (only if they find SMART disks), maybe because of the same reason?
Random googling https://stackoverflow.com/questions/48351096/how-to-emulate-a-sata-disk-drive-in-qemu Hmm :)
-drive id=disk,file=IMAGE.img,if=none \
-device ahci,id=ahci \
-device ide-hd,drive=disk,bus=ahci.0
So my suggestion would be to yolo hack Virtmachine.add_disk in (bots), make sure your cockpit checkout uses the changed bots and try to get TEST_DISK_XML to use AHCI. If that even supports SMART faking, if not we can't do this I suppose :(.
FYI udiskctl has a smart simulate options:
[root@archlinux ~]# udisksctl --help
Unknown command `--help'
Usage:
udisksctl COMMAND
Commands:
help Shows this information
info Shows information about an object
dump Shows information about all objects
status Shows high-level status
monitor Monitor changes to objects
mount Mount a filesystem
unmount Unmount a filesystem
unlock Unlock an encrypted device
lock Lock an encrypted device
loop-setup Set-up a loop device
loop-delete Delete a loop device
power-off Safely power off a drive
smart-simulate Set SMART data for a drive
@tomasmatus Are you still interested in working on this? It's quite a nice feature.
Wow its been a year already... yes, this is something i should revive @martinpitt
If i recall correctly the issue was tests, so trying to experiment with udisksctl smart-simulate sounds like a good idea to move this forward.
So, I think having SMART info is good. I remember I suggested it a long time ago in some issue and/or mockup. I don't remember where or what though. Where would this go?
It's usually stylized as SMART, not S.M.A.R.T. (which is just awkward). At least in English, acronyms and initialisms usually do not include periods. https://en.wikipedia.org/wiki/Acronym — and acronyms that are proper names (including SMART) or commonly used are generally not supposed to even have periods... and they're usually (but not always) all in uppercase. It's kind of weird with all kinds of special-casing, however; Wikipedia's acronym page actually has a "nomenclature" section with some examples: https://en.wikipedia.org/wiki/Acronym#Nomenclature — and it even covers pronunciation a bit (when some are pronounced and some are spelled out... and even some that are a mixture, like "JPEG" and "MS-DOS").
As far as being detailed; I think we should show an overview and use a "progressive disclosure" concept. (That is, show the important, common stuff first and then click somewhere to see everything.)
Since it looks like you're picking this up, it would be great to talk about the design of it. :+1:
I know we had the concept of SMART errors propagating all the way up to the health card on the overview too (which would then link to the details).
I am interested in this feature. Is there anything that can be done to progress the PR?
@SchoolGuy: A recent update related to this PR: @tomasmatus is back and is planning on picking this up (as you can see in an above comment). I will be working with him on the design. If you'd like to help out, that'd be fine as well.
In what ways are you interested in SMART? What data are you looking for, specifically? And when do you consult SMART data, usually? (From time to time or only when there are strange things going on as a diagnosis tool?)
@garrett I am no designer by trade as such I am not sure on what part of the design I may be of assistance. I should be able to dig into the code and help out with implementing an idea that somebody else had. If that is not available I may be able to design things on my own but they may be hard to use...
I am interested in SMART in two ways:
- The binary quick one: Is the drive currently "failed" due to one of the measurements failing a threshold or is everything fine.
- The longer one: Use the WebUI as a replacement for the CLI, aka look at the individual output of the metrics and decide for myself if I want to replace a drive due to me interpreting the values myself.
The idea for this specific project is to have Cockpit running as a WebUI for a single host with an mdadm array running and that resulting device is exposed as an NFS share that other hosts will use. In short: I want to use the host as a NAS.
I will grab the SMART values additionally with the smartd exporter and will let Prometheus grab that but since that will be hosted by one of the other hosts I want to be able to know if a failure there is related to the mdadm array being down or by a configuration error.
I am a quite paranoid person (regarding my infrastructure) and as such I am checking my monitoring quite regularly but it is not a fixed schedule I have.
I'm not a developer but I'm a Fedora packager and would be very interested in seeing this in Fedora/RHEL.
I can try to find some time to get back around to this to make some mockups. But the gist is that we need it in two places:
- Detailed view with just a few of the most important items shown by default, with a way to expand or view more to get to all of the stats.
- Current status in the health card on the overview page.
FWIW, here are the design ideas for disk issues (from a long time ago): https://github.com/cockpit-project/cockpit/issues/8787#issuecomment-789855897
SMART is mentioned, with regard to showing it in the health card:
- warning: SMART errors that don't stop functioning of disk
- error: SMART found bad sectors (disk is dying; data loss may have happened
We probably want to have some kind of status API where the overview page can query the storage page to get the SMART info and show it on the health card if there is a problem (and only if there is a problem), similar to the updates being on the health card when there are updates. (The software updates page is queried for this.)
Perfect. This is already a great amount of detail.
@tomasmatus how to you want to organize so we don't run into conflicts working on the same files?
@garrett I am no designer by trade as such I am not sure on what part of the design I may be of assistance. I should be able to dig into the code and help out with implementing an idea that somebody else had. If that is not available I may be able to design things on my own but they may be hard to use...
I am interested in SMART in two ways:
1. The binary quick one: Is the drive currently "failed" due to one of the measurements failing a threshold or is everything fine.
That fails into the scope of Cockpit!
2. The longer one: Use the WebUI as a replacement for the CLI, aka look at the individual output of the metrics and decide for myself if I want to replace a drive due to me interpreting the values myself.
What metrics are you interested in? UDisks provides some basic health metrics as seen in the screenshot on top. For other attributes we have to use SmartGetAttributes, so if you can give us a list of interesting attributes and maybe a busctl command with some data, that would be interesting and helpful.
The idea for this specific project is to have Cockpit running as a WebUI for a single host with an mdadm array running and that resulting device is exposed as an NFS share that other hosts will use. In short: I want to use the host as a NAS.
I will grab the SMART values additionally with the smartd exporter and will let Prometheus grab that but since that will be hosted by one of the other hosts I want to be able to know if a failure there is related to the mdadm array being down or by a configuration error.
Cockpit will only support showing SMART values for a single host, the host you connect too. We don't want to support anything outside of the UDisks API
For reference, udisks reports only this information from SMART:
$ udisksctl dump | grep -i smart
SmartCriticalWarning:
SmartPowerOnHours: 71
SmartSelftestPercentRemaining: -1
SmartSelftestStatus: success
SmartTemperature: 319
SmartUpdated: 1721742479
It's not much, but it's probably enough in most cases? If this is all, then we can't do the expand idea, unless we get more information, like from smartctl.
Is it worth looking to see how gnome-disk-utility looks up the extra data? https://gitlab.gnome.org/GNOME/gnome-disk-utility
Is it worth looking to see how gnome-disk-utility looks up the extra data? https://gitlab.gnome.org/GNOME/gnome-disk-utility
See my comment:
What metrics are you interested in? UDisks provides some basic health metrics as seen in the screenshot on top. For other attributes we have to use SmartGetAttributes, so if you can give us a list of interesting attributes and maybe a busctl command with some data, that would be interesting and helpful.
I've looked into testing this in a test virtual machine, the main issue is that even if I add a sata disk, udisks does not seem to identify this as a ATA drive:
On my nas
/org/freedesktop/UDisks2/drives/$ID:
org.freedesktop.UDisks2.Drive:
CanPowerOff: true
<snip>
org.freedesktop.UDisks2.Drive.Ata:
AamEnabled: false
Attaching a SATA disk does allow me to use smartctl on it.
I like the data that is presented here, including the ability to start a SMART test:
the data that is presented here
Right, that's the extended data that we're definitely not going to show by default. It's way too verbose and misleading.
We could, in a third stage, add all the details similar to that, possibly in a modal dialog similar to that. But, we should handle SMART in these steps:
- Basic details
- filtered list of what udisks gives us by default (mentioned above)... filtered to show only relevant information (for example: don't show progress of scan unless a scan is happening; don't show errors unless there are errors, etc.)
- the basic list udisks gives us by default seems to be reasonable
- this would likely be a key + value pair in a description list with a heading
- Overview integration
- only show with warnings and errors, with a link back to the basic details
- A way to get detailed information
- a "progressive disclosure" concept, where we show the important first (already implemented in step 1) and a way to see more information
- examples: this could be something like an expander widget at the bottom of the list or a modal (we'll need to figure this out still, but it can be deferred until after step 1 and 2 land)
This first PR should implement step # 1. Then we need to do a follow-up for step # 2 (overview). Then eventually add a step # 3 for more details.
We should not do this all in one go, nor should we just show all the data points SMART can give us. (Many are very misleading, scary, or not relevant depending on the circumstance. And it varies per manufacturer and storage medium (spinning disks of different types, SSD, NVMe, etc.). It's a mess, really.)
(I think @tomasmatus and I will talk about this more either later this week or next week.)
For testing SMART in CI, there is a udisks issue now https://issues.redhat.com/browse/STORAGECFG-801
For testing SMART in CI, there is a udisks issue now https://issues.redhat.com/browse/STORAGECFG-801
Please check this release out: https://github.com/storaged-project/udisks/releases/tag/udisks-2.10.90
Random googling https://stackoverflow.com/questions/48351096/how-to-emulate-a-sata-disk-drive-in-qemu Hmm :)
-drive id=disk,file=IMAGE.img,if=none \ -device ahci,id=ahci \ -device ide-hd,drive=disk,bus=ahci.0
This works now in UDisks, however it's still an ide-hd device emulation and even though it's connected through an emulated AHCI controller, it still doesn't talk the SAT (SCSI / ATA Translation) protocol. This was the culprit and to some degree it still is as UDisks only supports SAT and nothing else (looking for an expert to make this happen!). Similarly, udev's ata_id also supports SAT only but has a bug in the code that makes this work.
However I've found this hidden workaround in libatasmart to make this work with qemu devices: https://github.com/storaged-project/libblockdev/blob/master/src/plugins/smart/smart-common.c#L131-L167
Let me know if it helps your testing use case.
I installed newest udisks from copr and played a bit with virsh attach-disk and different --targetbus x types. I found out that --targetbus sata does not support hotplugging so if I add the following to TEST_DOMAIN_XML in bots I am able to get a disk that shows as drive ata and has SMART info available
<disk type='file'>
<driver name='qemu' type='qcow2' cache='unsafe'/>
<source file='/home/tmatus/projects/storage-img.qcow2'/>
<target dev='sda' bus='sata'/>
<serial>SATADISK</serial>
<boot order='3'/>
</disk>
After running the VM I can simulate SMART info with udisksctl smart-simulate. What I can't do is start a SMART test. (I suppose that makes sense when it's all simulated). For other targetbus types I'm not seeing any SMART data.
-drive id=disk,file=IMAGE.img,if=none
-device ahci,id=ahci
-device ide-hd,drive=disk,bus=ahci.0
I tried to look into this as that is for a different tool, not virsh. I didn't find any mentions of ahci in virsh so can't say for sure if that is better or same solution.
After running the VM I can simulate SMART info with
udisksctl smart-simulate. What I can't do is start a SMART test. (I suppose that makes sense when it's all simulated).
Yeah, trigerring SMART tests wouldn't work on any emulated device as far as I know.
For other
targetbustypes I'm not seeing any SMART data.
You'd likely need to tweak the ID_ATA_SMART_ACCESS udev attribute to reflect the actual protocol for the qemu emulated device. It really depends what protocol qemu actually implements.
-drive id=disk,file=IMAGE.img,if=none -device ahci,id=ahci -device ide-hd,drive=disk,bus=ahci.0
I tried to look into this as that is for a different tool, not virsh. I didn't find any mentions of
ahciin virsh so can't say for sure if that is better or same solution.
That's a qemu commandline, as libvirt is just a wrapper, it'll likely have corresponding features in a slightly different syntax (i.e. sata == ahci in most cases).
(notes to self, will update with more later)
Testing this on my trusty WD friend:
I am able to start the test and see "live" updates on the progress status:
It's not fully live progress bar. After talking to @tbzatek he mentioned this kind of depends on the disk how it reports it. In my case it seems to jump in +10% or +20% increments. It could also depend on how often cockpit receives updates from udisks.
I can abort the test while it's running:
I can see one bad sector on my SSD. The disk still reports itself as healthy.
We need udisks-nightly or well the udisks pre-release for testing this feature as it now uses udev instead of direct drive access for determing disk types etc. This apparently works better for testing emulating SMART data in a virtual machine.
We have two options to get this tested:
- A new scenario, which like /nightly installs udisks from copr. The downside is that we test against a moving target in PR's
- Downloading the udisks nightly packages on the Fedora image like we for example do for clevis packages in the RHEL image (the reason escapes me)
dnf download --downloaddir=/var/lib/package-sets/clevis $PACKAGE_SET_CLEVIS
In the test you would install those packages and kinda pray you have no dependency resolving issues :smile:
Now for adding the SATA disk, that's a good question....
It is officially not supported.
We can't easily override the so an idea I had was:
class TestSmart(testlib.MachineCase):
provision = {
"0": { "drive_bus": "sata" },
}
And then you'll need to hack on your own bots checkout:
class VirtMachine(Machine):
networking: dict[str, Any]
virt_connection: libvirt.virConnect | None
_transient_image: IO[bytes] | None
_domain: libvirt.virDomain | None
network: int | None = None
memory_mb: int | None = None
cpus: int | None = None
is_efi: bool = False
image_file: str
run_dir: str
def __init__(
self,
image: str,
networking: dict[str, Any] | None = None,
maintain: bool = False,
memory_mb: int | None = None,
cpus: int | None = None,
capture_console: bool = True,
**kwargs: Any
):
if kwargs.get("disk_bus"):
something_smart()
Basically the domain XML to get a drive_bus format argument
test_domain_desc = TEST_DOMAIN_XML.format(**keys)
Thanks @jelly. I agree that we should only throw so much effort on automating the test here -- if it's not possible with current udisks, there's that. But as far as I understand it, this is exclusively about testing with a mock smart device, right? i.e. current udisks can trigger smart in current versions, just not inject mock data?
Please test it carefully on actual hardware. The tests should then do a feature test (like udisks version) so that it can run in the /nightly scenario -- you can easily trigger that in PRs. Then we have at least one place where it's covered, and over time it will flow into the distros. IMHO adding a package set etc. is overkill.
But as far as I understand it, this is exclusively about testing with a mock smart device, right? i.e. current udisks can trigger smart in current versions, just not inject mock data?
Correct, current udisks works with the real hardware. The newer version (pre-release) is only needed only for virtual disks so I can mock something in the test. @martinpitt
Please test it carefully on actual hardware.
I am trying to do this as much as I can. I can do that on my laptop with fedora 41 and desktop with arch which covers all three cases of HDD, SATA SSD and NVMe device.