node_exporter
node_exporter copied to clipboard
ZFS: io stats removed
Host operating system:
OpenZFS as of commit https://github.com/openzfs/zfs/commit/371f88d96fe0aeb46a72fec78f90e1d777493ee5
node_exporter version
$ /usr/bin/prometheus-node-exporter --version
node_exporter, version 0.17.0+ds (branch: debian/sid, revision: 0.17.0+ds-3+b11)
build user: [email protected]
build date: 20190608-10:00:14
go version: go1.11.6
...through to master, current HEAD (3762191e66e3)
node_exporter command line flags
--collector.zfs
What did you do that produced an error?
Upgraded to OpenZFS HEAD https://github.com/openzfs/zfs/commit/HEAD
What did you expect to see?
$ wget -qO - http://server:9100/metrics | grep ^node_zfs_zpool || echo "NOT FOUND"
node_zfs_zpool_nread{zpool="pool"} 2.49897949175808e+14
node_zfs_zpool_nwritten{zpool="pool"} 2.65591973699584e+14
node_zfs_zpool_rcnt{zpool="pool"} 0
node_zfs_zpool_reads{zpool="pool"} 2.729451643e+09
node_zfs_zpool_rlentime{zpool="pool"} 4.728605718663304e+16
node_zfs_zpool_rtime{zpool="pool"} 5.775579362149045e+15
node_zfs_zpool_rupdate{zpool="pool"} 1.3886342123071748e+16
node_zfs_zpool_wcnt{zpool="pool"} 0
node_zfs_zpool_wlentime{zpool="pool"} 2.8960276949205395e+17
node_zfs_zpool_writes{zpool="pool"} 3.436014883e+09
node_zfs_zpool_wtime{zpool="pool"} 1.319409654928622e+15
node_zfs_zpool_wupdate{zpool="pool"} 1.3886342121834806e+16
What did you see instead?
$ wget -qO - http://server:9100/metrics | grep ^node_zfs_zpool || echo "NOT FOUND"
NOT FOUND
Node_exporter uses /proc/spl/kstat/zfs/*/io to collect these stats. This proc file has been removed as of OpenZFS https://github.com/openzfs/zfs/commit/371f88d96fe0aeb46a72fec78f90e1d777493ee5.
See Also: https://github.com/openzfs/zfs/pull/12212
Ok, looks like someone needs to implement the "more efficient ways .. to obtain detailed statistics".
The "more efficient ways" are probably as done by zpool iostat. That uses an zfs ioctl to pull in at least the nread, nwritten, reads and writes part of the stats (i.e. the stuff I'm most interested in). E.g.:
# zpool iostat -p
capacity operations bandwidth
pool alloc free read write read write
---------- --------------- --------------- --------------- --------------- --------------- ---------------
pool 126300985387008 71611107612672 1646 3625 17295293 8371479
fast 4126006828544 3845452472832 12 92 516551 8286794
---------- --------------- --------------- --------------- --------------- --------------- ---------------
(And more stats are available, e.g. -q shows queue stats, -v shows per-vdev stats.)
The question would be how to pull this stuff in:
- run the external command and parse the output
- use the
libzfs2functions aszpool iostatdoes, e.g.zpool_refresh_stats() - use the ioctl directly
I'm not sure if
libzfs2is supposed to be internal only or meant to be used by external projects.
The other fields in the now-defunct "io" file are from struct kstat_io described here:
https://github.com/openzfs/zfs/blob/master/lib/libspl/include/sys/kstat.h#L582
I don't know which of these are more or less useful to try to get.
Sadly, my knowledge of Go is practically nonexistent, as is the time I have to put towards this.
We don't allow running external commands. Technically using libzfs2 might be an option, but we'd try to avoid linking to c libraries. So ideally we can use the ioctl directly but dunno how involved that would be.
Looks like it's not straight forward (you've gotta pull apart "nvlists" and ...stuff), but someone's started on it: https://github.com/lorenz/go-zfs
There's another "go-zfs" project at https://github.com/mistifyio/go-zfs but that one appears to be a wrapper around the command line utilities so not suitable given "We don't allow running external commands".
Huh. And another at https://github.com/bicomsystems/go-libzfs which links to the libzfs c library so less suitable.
@discordianfish @chrisrd anything new on this? Encountered the same bug after upgrading openzfs to the latest release ...
@discostur Nothing new from me - I'm not in a position to fix this.
In the meantime I'm just monitoring one of the disks in my pool with:
irate(node_disk_read_bytes_total{device="sdf"}[5m])
irate(node_disk_written_bytes_total{device="sdf"}[5m])
This obviously doesn't provide actual pool-wide read/write stats, but it's enough to track how busy the pool is over time.
@discostur No unfortunately not. Someone should look into using to get these stats from https://github.com/lorenz/go-zfs. I might get to that eventually but can't commit to when.
FYI I (the go-zfs author) am working on getting a ZFS exporter up and running. Once I have something we can work on integrating it into node_exporter.
I invite everyone here to give feedback on https://github.com/lorenz/zfs_exporter. As we get consensus on which metrics to expose, how and under what names we can start merging this back into node_exporter, but I think we can iterate this faster outside.
Awesome! For the node-exporter we're trying to not break existing metrics before the next major release, so ideally in the version for the node-exporter the change shouldn't change the metrics generated for now. Or it could be a new collector that is disabled by default.
Alternatively, I was also thinking about adding versioned metrics endpoint. But I guess that requires some further discussion which I should finally kick off.
Hello
was this resolved somehow, are there new metrics/collector available that exposes node_zfs_zpool_*?
Thanks
Just a note, since I'm not sure it was clear from the conversation so far, this is breaking the collection of all zpool metrics, since if the */io procfiles aren't found the collector doesn't try any of the others. https://github.com/prometheus/node_exporter/blob/master/collector/zfs_linux.go#L77=
As a quick fix, would it be possible to make the exporter skip over the IO metrics if their paths are missing, to make the other metrics work again? 'state' especially would be useful, since zed's state alerts don't fire for some states if you're on zfs < 2.1.3.
@calliecameron That sounds reasonable to me
As a note, Ubuntu 22.04 has a version of ZFS that lacks the old 'io' kstat file for per-pool IO statistics. Anyone upgrading Ubuntu ZFS machines is no longer getting their ZFS stats through the node_exporter, perhaps unlike what they expect (the stats are certainly important to our operations, and we were looking forward to per-filesystem stats).
I don't agree that #2451 has resolved this issue: it fixes the case that the missing zpool IO metrics would cause the rest of the (otherwise available) zpool metrics to be unavailable, but the original issue was specifically that the zpool IO metrics are no longer available (due to the 'io' kstat file being defunct). These metrics ARE available via a system call. The solution should be to use the system call to make the zpool IO metrics available once again.
Unfortunately, the ZFS IO stats that are now available have changed significantly such that some of the metrics simply can't be generated any more; there is no drop-in replacement system call that could be used to get the traditional metrics (and no way to synthetically generate all of them). An illustration of what's available in the 'new' ZFS IO stats (and implicitly what's missing) can be found in eg siebenmann/zfs_exporter (which we currently use in production). There are enough metrics and enough choices involved that I'm far from confident that they should all be shoved into the node_exporter.
(Some of the 'new' ZFS IO stats are actually present fairly far back, because they're the information 'zpool iostat' will present. The old pool-level IO stats were a separate system that OpenZFS eliminated for reasons; see https://github.com/openzfs/zfs/pull/12212 )
Thanks @siebenmann, I'll certainly take a look at siebenmann/zfs_exporter.
For my use-case, I don't specifically need "the same as before" IO stats, just some stats that allow monitoring the pool level IO, so your zfs_exporter should certainly do that.
I still think this ticket should remain open until the 'new' IO pool level stats (such as available via zpool iostat -p) are available in node_exporter.
@SuperQ / @discordianfish please consider reopening, hiding the errors does not solve the issue described here