node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

Occasional ZFS collector errors on kstat.zfs.misc.arcstats.memory_available_bytes

Open siebenmann opened this issue 2 years ago • 5 comments

Periodically I see 'cannot parse expected integer value' errors logged for the ZFS collector trying to collect memory_available_bytes, which comes from /proc/spl/kstat/zfs/arcstat. Support for this was added to deal with #2656, because this kstat is a type 3 (signed int64) instead of the more usual type 4 (unsigned int64).

Looking at the code involved, I believe this is most likely happening when memory_available_bytes goes negative, which it can; this possibility is why it's a signed int64 kstat instead of an unsigned in64. The current code in collector/zfs_linux.go parses both kstatDataUint64 and kstatDataInt64 data with strconv.ParseUint(), which is not going to accept negative numbers.

As a small RFE, it would be nice if this specific error message logged either or both of the specific error from ParseUint() and the actual value that failed to parse. As it is, I'm guessing at the actual problem involved from my knowledge of the range of ``memory_available_bytes` and the code.

Host operating system: output of uname -a

Linux hawkwind.cs 6.4.6-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 24 20:51:12 UTC 2023 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.6.1 (branch: HEAD, revision: 4a1b77600c1873a8233f3ffb55afcedbb63b8d84)

node_exporter log output

ts=2023-08-01T14:58:27.022Z caller=collector.go:169 level=error msg="collector failed" name=zfs duration_seconds=0.098266305 err="could not parse expected integer value for \"kstat.zfs.misc.arcstats.memory_available_bytes\""

Are you running node_exporter in Docker?

No.

siebenmann avatar Aug 01 '23 18:08 siebenmann

The uint64 types seem to run fairly deep in zfs_linux.go. The parseProcfsFile function, which is used to parse other files under /proc/spl/zfs, expects to be passed a handler func(zfsSysctl, uint64). It seems that it was designed from the ground up to only support unsigned ints.

The obvious solution would be to change the handler function signature to func(zfsSysctl, float64) so that it can handle both positive and negative values. I don't think we would lose anything by casting to float64 early.

The TestArcstatsParsing looks pretty superficial, and actually only verifies the value of kstat.zfs.misc.arcstats.hits. This looks like it could be made a bit more thorough. The test fixtures look outdated too, as they don't even have a line for memory_available_bytes (or a few other memory_* items).

dswarbrick avatar Aug 03 '23 12:08 dswarbrick

Why float64 not int64?

discordianfish avatar Aug 08 '23 12:08 discordianfish

Why float64 not int64?

Because the kernel uses uint64_t for some values, which would potentially overflow an int64.

dswarbrick avatar Aug 08 '23 12:08 dswarbrick

float64 is only accurate for integers constrained within it's mantissa range (52-bit). Expressing an integer that goes beyond (2^52, or 2^53 if we consider the implicit additional bit) will lead to loss of precision (rounding off).

rexagod avatar Mar 18 '24 21:03 rexagod

float64 is only accurate for integers constrained within it's mantissa range (52-bit). Expressing an integer that goes beyond (2^52, or 2^53 if we consider the implicit additional bit) will lead to loss of precision (rounding off).

This is true, and it is why snmp_exporter wraps Counter64 values at the float64 mantissa 2^53. Perhaps this should be a general approach for all 64-bit counters.

dswarbrick avatar Mar 18 '24 22:03 dswarbrick