node_exporter
node_exporter copied to clipboard
Add btrfs device error stats
Makes ioctl syscalls to load the device error stats you'd normally see using btrfs device stats /mnt/myFs.
On the command line, you'd normally see this kind of info via:
sudo btrfs device stats /mnt/fakebtrfs2
[/dev/loop1].write_io_errs 0
[/dev/loop1].read_io_errs 0
[/dev/loop1].flush_io_errs 0
[/dev/loop1].corruption_errs 0
[/dev/loop1].generation_errs 0
[/dev/loop2].write_io_errs 0
[/dev/loop2].read_io_errs 0
[/dev/loop2].flush_io_errs 0
[/dev/loop2].corruption_errs 0
[/dev/loop2].generation_errs 0
All feedback appreciated!
Requested in #2173 & #1100
I should add, instead of adding the ioctl code to this repo, I could introduce a dependency on https://github.com/dennwc/btrfs but I don't know that much about the package (nor current go dependency management tooling). I'm not sure what would be preferable here!
@discordianfish/@SuperQ I could do with some steer from y'all on a couple of points:
- Keep the ioctl calls here, or try and use https://github.com/dennwc/btrfs?
- Linux 5.14 adds the device error stats to sysfs; is there value in an ioctl fallback for older versions?
Sorry to chase, but could do with some guidance here @SuperQ, @discordianfish? Thanks!
Sorry for not seeing this ealier!
- Keep the ioctl calls here, or try and use https://github.com/dennwc/btrfs? Using https://github.com/dennwc/btrfs sounds like a good idea.
- Linux 5.14 adds the device error stats to sysfs; is there value in an ioctl fallback for older versions? Yes, ideally we should support older versions as well
Linux 5.14 adds the device error stats to sysfs
I did a bit more investigation on this; TLDR: using sysfs alone we can get the error counts, but we can't attribute them nicely to a specific device! The operator would have to manually do the extra legwork to match the two.
/sys/fs/btrfs/<uuid>/devinfo/<deviceId>/error_stats contains the following
write_errs 0
read_errs 0
flush_errs 0
corruption_errs 0
generation_errs 0
Unfortunately, the deviceIds look like this
$ ls -U1 /sys/fs/btrfs/2dab4553-c77e-4bd7-92f1-fd916b08093d/devinfo/
5
1
4
2
Notably these numbers match what we see in ioctl calls
ioctl(3, BTRFS_IOC_GET_DEV_STATS, {devid=makedev(0, 0x1), nr_items=5, flags=0} => {nr_items=5, flags=0, [[BTRFS_DEV_STAT_WRITE_ERRS] = 0, [BTRFS_DEV_STAT_READ_ERRS] = 0, [BTRFS_DEV_STAT_FLUSH_ERRS] = 0, [BTRFS_DEV_STAT_CORRUPTION_ERRS] = 0, [BTRFS_DEV_STAT_GENERATION_ERRS] = 0]}) = 0
readlink("/dev/sdd", 0x7ffe49e55240, 1023) = -1 EINVAL (Invalid argument)
ioctl(3, BTRFS_IOC_GET_DEV_STATS, {devid=makedev(0, 0x2), nr_items=5, flags=0} => {nr_items=5, flags=0, [[BTRFS_DEV_STAT_WRITE_ERRS] = 186, [BTRFS_DEV_STAT_READ_ERRS] = 0, [BTRFS_DEV_STAT_FLUSH_ERRS] = 0, [BTRFS_DEV_STAT_CORRUPTION_ERRS] = 0, [BTRFS_DEV_STAT_GENERATION_ERRS] = 0]}) = 0
readlink("/dev/sda", 0x7ffe49e55240, 1023) = -1 EINVAL (Invalid argument)
ioctl(3, BTRFS_IOC_GET_DEV_STATS, {devid=makedev(0, 0x4), nr_items=5, flags=0} => {nr_items=5, flags=0, [[BTRFS_DEV_STAT_WRITE_ERRS] = 0, [BTRFS_DEV_STAT_READ_ERRS] = 0, [BTRFS_DEV_STAT_FLUSH_ERRS] = 0, [BTRFS_DEV_STAT_CORRUPTION_ERRS] = 0, [BTRFS_DEV_STAT_GENERATION_ERRS] = 0]}) = 0
readlink("/dev/sdc", 0x7ffe49e55240, 1023) = -1 EINVAL (Invalid argument)
ioctl(3, BTRFS_IOC_GET_DEV_STATS, {devid=makedev(0, 0x5), nr_items=5, flags=0} => {nr_items=5, flags=0, [[BTRFS_DEV_STAT_WRITE_ERRS] = 0, [BTRFS_DEV_STAT_READ_ERRS] = 0, [BTRFS_DEV_STAT_FLUSH_ERRS] = 0, [BTRFS_DEV_STAT_CORRUPTION_ERRS] = 0, [BTRFS_DEV_STAT_GENERATION_ERRS] = 0]}) = 0
Which is at odds with the devices list:
$ ls -U1 /sys/fs/btrfs/2dab4553-c77e-4bd7-92f1-fd916b08093d/devices
sdd
sdb
sdc
sda
I checked whether the sorting orders were aligned (by stracing btrfs device stats), but they don't seem to be.
The devinfo folder, does not contain any useful device IDs
in_fs_metadata
scrub_speed_max
replace_target
writeable
missing
error_stats
There might be a way to map these 'deviceIDs' back to the same thing, but I don't know linux internals well enough!
What we usually do in these situations is expose a info metric that can be joined with the actual metrics. Something like device_info{id="2dab4553-c77e-4bd7-92f1-fd916b08093d", device="/dev/sda"} 1. But yeah, we'd need to find out how to get the device mapping in the first place.
What we usually do in these situations is expose a info metric that can be joined with the actual metrics. Something like
device_info{id="2dab4553-c77e-4bd7-92f1-fd916b08093d", device="/dev/sda"} 1. But yeah, we'd need to find out how to get the device mapping in the first place.
Thanks! I think we should be able to avoid that, we'll just need to use ioctl instead of sysfs.
If I find the time, I'll try posting on the btrfs mailing list to make them aware of the usability issues.
@discordianfish would you mind approving the CI? :)
@SuperQ & @discordianfish I think I'm ready for a review :)
Here's a short script to set up a tiny btrfs FS, which helps for testing this out locally!
truncate -s1G 1GB-1.img
truncate -s1G 1GB-2.img
ld1=$(sudo losetup --show --find 1GB-1.img); echo "$ld1"
ld2=$(sudo losetup --show --find 1GB-2.img); echo "$ld2"
sudo mkfs -t btrfs "$ld1" "$ld2"
sudo mkdir /mnt/fakebtrfs
sudo mount "$ld1" /mnt/fakebtrfs
Thanks for the feedback!
Hi @SuperQ! Sorry to chase, but do you have time to review or advise on next steps? Thanks!
Hi @leth, I gave your branch leth/node_exporter/tree/btrds-device-stats a try and I get btrfs metrics as usual, but no error metrics (except for the generic node_filesystem_device_error). My filesystem doesn't have any errors though. Do the error metrics only appear when the counters are >0 or am I holding it wrong? :)
am I holding it wrong? :)
I was holding it wrong and successfully tested the master branch. :cocktail: Using the correct branch gives me:
node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="corruption",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0
node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="flush",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0
node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="generation",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0
node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="read",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0
node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="write",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0
..which looks more or less as expected, though I'm not sure where the device_uuid comes from. I only have a single device sdc1 for my btrfs fs, and its UUID is d163af2f-.. not 01d2fa5e-..
$ls -l /sys/fs/btrfs/d163af2f-6e03-4972-bfd6-30c68b6ed312/devices
total 0
lrwxrwxrwx 1 root root 0 Jul 20 21:28 sdc1 -> ../../../../devices/pci0000:00/0000:00:1f.2/ata6/host5/target5:0:0/5:0:0:0/block/sdc/sdc1
Ah! Thanks! Whoops, would you mind checking if it's the partition uuid?
On 23 Jul 2022, 19:00, at 19:00, "Holger Hoffstätte" @.***> wrote:
am I holding it wrong? :)
I was holding it wrong and successfully tested the master branch. :cocktail: Using the correct branch gives me:
node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="corruption",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0 node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="flush",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0 node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="generation",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0 node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="read",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0 node_btrfs_device_errors_total{device="sdc1",device_uuid="01d2fa5e-0994-4d29-add7-8596122817ca",type="write",uuid="d163af2f-6e03-4972-bfd6-30c68b6ed312"} 0..which looks more or less as expected, though I'm not sure where the device_uuid comes from. I only have a single device
sdc1for my btrfs fs, and its UUID isd163af2f-..not01d2fa5e-..$ls -l /sys/fs/btrfs/d163af2f-6e03-4972-bfd6-30c68b6ed312/devices total 0 lrwxrwxrwx 1 root root 0 Jul 20 21:28 sdc1 -> ../../../../devices/pci0000:00/0000:00:1f.2/ata6/host5/target5:0:0/5:0:0:0/block/sdc/sdc1-- Reply to this email directly or view it on GitHub: https://github.com/prometheus/node_exporter/pull/2193#issuecomment-1193164661 You are receiving this because you were mentioned.
Message ID: @.***>
Ah! Thanks! Whoops, would you mind checking if it's the partition uuid?
Indeed it is!
$blkid /dev/sdc1
/dev/sdc1: LABEL="Backups" UUID="d163af2f-6e03-4972-bfd6-30c68b6ed312" UUID_SUB="01d2fa5e-0994-4d29-add7-8596122817ca" BLOCK_SIZE="4096" TYPE="btrfs" PARTUUID="23109a24-e06a-49df-94bd-9035efdc1c9f"
Hmm, I've been trying to find an explanation of UUID_SUB to not much avail.
/dev/disk/by-uuid definitely points to the UUID field as identifying the disk.
It's not the partition ID, as your blkid has that as a different value.
I'm tempted to drop the label, but I'm worried that without the label the metrics for 2 different disks could overlap (if the path part remaining is non-unique for some reason?). I'm guessing quite a bit here, perhaps it could be BTRFs' internal identifier for this disk?
I'm gonna rename device_uuid to btrfs_dev_uuid.
Hrm, the tests seem stuck, going to try and un-stick them.
codespell error is unrelated, due to an out of date branch here.
Thank you!
On 24 Sep 2022, 07:25, at 07:25, Ben Kochie @.***> wrote:
Merged #2193 into master.
-- Reply to this email directly or view it on GitHub: https://github.com/prometheus/node_exporter/pull/2193#event-7451433153 You are receiving this because you were mentioned.
Message ID: @.***>