node_exporter ZFS: io stats removed

trafficstars

Host operating system:

OpenZFS as of commit https://github.com/openzfs/zfs/commit/371f88d96fe0aeb46a72fec78f90e1d777493ee5

node_exporter version

$ /usr/bin/prometheus-node-exporter --version
node_exporter, version 0.17.0+ds (branch: debian/sid, revision: 0.17.0+ds-3+b11)
  build user:       [email protected]
  build date:       20190608-10:00:14
  go version:       go1.11.6

...through to master, current HEAD (3762191e66e3)

node_exporter command line flags

--collector.zfs

What did you do that produced an error?

Upgraded to OpenZFS HEAD https://github.com/openzfs/zfs/commit/HEAD

What did you expect to see?

$ wget -qO - http://server:9100/metrics | grep ^node_zfs_zpool || echo "NOT FOUND"
node_zfs_zpool_nread{zpool="pool"} 2.49897949175808e+14
node_zfs_zpool_nwritten{zpool="pool"} 2.65591973699584e+14
node_zfs_zpool_rcnt{zpool="pool"} 0
node_zfs_zpool_reads{zpool="pool"} 2.729451643e+09
node_zfs_zpool_rlentime{zpool="pool"} 4.728605718663304e+16
node_zfs_zpool_rtime{zpool="pool"} 5.775579362149045e+15
node_zfs_zpool_rupdate{zpool="pool"} 1.3886342123071748e+16
node_zfs_zpool_wcnt{zpool="pool"} 0
node_zfs_zpool_wlentime{zpool="pool"} 2.8960276949205395e+17
node_zfs_zpool_writes{zpool="pool"} 3.436014883e+09
node_zfs_zpool_wtime{zpool="pool"} 1.319409654928622e+15
node_zfs_zpool_wupdate{zpool="pool"} 1.3886342121834806e+16

What did you see instead?

$ wget -qO - http://server:9100/metrics | grep ^node_zfs_zpool || echo "NOT FOUND"
NOT FOUND

Node_exporter uses /proc/spl/kstat/zfs/*/io to collect these stats. This proc file has been removed as of OpenZFS https://github.com/openzfs/zfs/commit/371f88d96fe0aeb46a72fec78f90e1d777493ee5.

See Also: https://github.com/openzfs/zfs/pull/12212

Jul 05 '21 00:07 chrisrd

Ok, looks like someone needs to implement the "more efficient ways .. to obtain detailed statistics".

Jul 06 '21 09:07 discordianfish

The "more efficient ways" are probably as done by zpool iostat. That uses an zfs ioctl to pull in at least the nread, nwritten, reads and writes part of the stats (i.e. the stuff I'm most interested in). E.g.:

# zpool iostat -p
                        capacity                         operations                         bandwidth           
pool                  alloc             free             read            write             read            write
----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
pool        126300985387008   71611107612672             1646             3625         17295293          8371479
fast          4126006828544    3845452472832               12               92           516551          8286794
----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------

(And more stats are available, e.g. -q shows queue stats, -v shows per-vdev stats.)

The question would be how to pull this stuff in:

run the external command and parse the output
use the libzfs2 functions as zpool iostat does, e.g. zpool_refresh_stats()
use the ioctl directly I'm not sure if libzfs2 is supposed to be internal only or meant to be used by external projects.

The other fields in the now-defunct "io" file are from struct kstat_io described here:

https://github.com/openzfs/zfs/blob/master/lib/libspl/include/sys/kstat.h#L582

I don't know which of these are more or less useful to try to get.

Sadly, my knowledge of Go is practically nonexistent, as is the time I have to put towards this.

Jul 07 '21 01:07 chrisrd

We don't allow running external commands. Technically using libzfs2 might be an option, but we'd try to avoid linking to c libraries. So ideally we can use the ioctl directly but dunno how involved that would be.

Jul 07 '21 11:07 discordianfish

Looks like it's not straight forward (you've gotta pull apart "nvlists" and ...stuff), but someone's started on it: https://github.com/lorenz/go-zfs

Jul 08 '21 00:07 chrisrd

There's another "go-zfs" project at https://github.com/mistifyio/go-zfs but that one appears to be a wrapper around the command line utilities so not suitable given "We don't allow running external commands".

Jul 08 '21 01:07 chrisrd

Huh. And another at https://github.com/bicomsystems/go-libzfs which links to the libzfs c library so less suitable.

Jul 08 '21 01:07 chrisrd

@discordianfish @chrisrd anything new on this? Encountered the same bug after upgrading openzfs to the latest release ...

Aug 04 '21 11:08 discostur

@discostur Nothing new from me - I'm not in a position to fix this.

In the meantime I'm just monitoring one of the disks in my pool with:

irate(node_disk_read_bytes_total{device="sdf"}[5m])
irate(node_disk_written_bytes_total{device="sdf"}[5m])

This obviously doesn't provide actual pool-wide read/write stats, but it's enough to track how busy the pool is over time.

Aug 05 '21 01:08 chrisrd

@discostur No unfortunately not. Someone should look into using to get these stats from https://github.com/lorenz/go-zfs. I might get to that eventually but can't commit to when.

Aug 05 '21 09:08 discordianfish

FYI I (the go-zfs author) am working on getting a ZFS exporter up and running. Once I have something we can work on integrating it into node_exporter.

Aug 14 '21 19:08 lorenz

I invite everyone here to give feedback on https://github.com/lorenz/zfs_exporter. As we get consensus on which metrics to expose, how and under what names we can start merging this back into node_exporter, but I think we can iterate this faster outside.

Sep 13 '21 19:09 lorenz

Awesome! For the node-exporter we're trying to not break existing metrics before the next major release, so ideally in the version for the node-exporter the change shouldn't change the metrics generated for now. Or it could be a new collector that is disabled by default.

Alternatively, I was also thinking about adding versioned metrics endpoint. But I guess that requires some further discussion which I should finally kick off.

Sep 14 '21 21:09 discordianfish

Hello

was this resolved somehow, are there new metrics/collector available that exposes node_zfs_zpool_*?

Thanks

May 18 '22 09:05 jakubvojacek

Just a note, since I'm not sure it was clear from the conversation so far, this is breaking the collection of all zpool metrics, since if the */io procfiles aren't found the collector doesn't try any of the others. https://github.com/prometheus/node_exporter/blob/master/collector/zfs_linux.go#L77=

May 24 '22 14:05 retzkek

As a quick fix, would it be possible to make the exporter skip over the IO metrics if their paths are missing, to make the other metrics work again? 'state' especially would be useful, since zed's state alerts don't fire for some states if you're on zfs < 2.1.3.

Jun 20 '22 16:06 calliecameron

@calliecameron That sounds reasonable to me

Jul 05 '22 17:07 discordianfish

As a note, Ubuntu 22.04 has a version of ZFS that lacks the old 'io' kstat file for per-pool IO statistics. Anyone upgrading Ubuntu ZFS machines is no longer getting their ZFS stats through the node_exporter, perhaps unlike what they expect (the stats are certainly important to our operations, and we were looking forward to per-filesystem stats).

Jul 08 '22 17:07 siebenmann

I don't agree that #2451 has resolved this issue: it fixes the case that the missing zpool IO metrics would cause the rest of the (otherwise available) zpool metrics to be unavailable, but the original issue was specifically that the zpool IO metrics are no longer available (due to the 'io' kstat file being defunct). These metrics ARE available via a system call. The solution should be to use the system call to make the zpool IO metrics available once again.

Feb 06 '23 23:02 chrisrd

Unfortunately, the ZFS IO stats that are now available have changed significantly such that some of the metrics simply can't be generated any more; there is no drop-in replacement system call that could be used to get the traditional metrics (and no way to synthetically generate all of them). An illustration of what's available in the 'new' ZFS IO stats (and implicitly what's missing) can be found in eg siebenmann/zfs_exporter (which we currently use in production). There are enough metrics and enough choices involved that I'm far from confident that they should all be shoved into the node_exporter.

(Some of the 'new' ZFS IO stats are actually present fairly far back, because they're the information 'zpool iostat' will present. The old pool-level IO stats were a separate system that OpenZFS eliminated for reasons; see https://github.com/openzfs/zfs/pull/12212 )

Feb 07 '23 00:02 siebenmann

Thanks @siebenmann, I'll certainly take a look at siebenmann/zfs_exporter.

For my use-case, I don't specifically need "the same as before" IO stats, just some stats that allow monitoring the pool level IO, so your zfs_exporter should certainly do that.

I still think this ticket should remain open until the 'new' IO pool level stats (such as available via zpool iostat -p) are available in node_exporter.

Feb 07 '23 01:02 chrisrd

@SuperQ / @discordianfish please consider reopening, hiding the errors does not solve the issue described here

Jun 13 '23 17:06 s-burris

node_exporter node_exporter copied to clipboard

ZFS: io stats removed

Host operating system:

node_exporter version

node_exporter command line flags

What did you do that produced an error?

What did you expect to see?

What did you see instead?

node_exporter
node_exporter copied to clipboard