node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

collector/filesystem: Handle `Statfs_t` overflows

Open rexagod opened this issue 1 year ago • 1 comments

Handle cases where, owing to multiplying two uint64 integers and typecasting it to float64, the overall precision is lost when the values concerned exceed the floatMantissa64 (1 << 53) before or after the operation (which is well within the acceptable uint64 range).

Fixes: #1672

rexagod avatar Mar 19 '24 20:03 rexagod

If you are going to make use of math/big, wouldn't it make more sense to handle the number as big.Int (since they are originally integer types), and only convert to float64 at the end to hand off to the channel? I don't really see the point in prematurely casting them to big.Float, since int(a) * int(b) will by definition produce an int.

dswarbrick avatar May 16 '24 18:05 dswarbrick

Feel free to correct me if I'm missing something, but I believe we are casting them right before returning? Also, except Bsize, all Statfs_t fields being used here are uint types (not int).

Additionally, int(a) * int(b) is also subject to under-or-over-flow, so casting needs to be done before the operation happens, otherwise the int(c) produced as a result of implicit casting may not be the expected value.

rexagod avatar May 20 '24 10:05 rexagod

@rexagod Yes, I'm aware that int(a) * int(b) (in the computer programming sense) can under/overflow. I meant more in the mathematical sense - a whole number multiplied by a whole number will always produce a whole number result.

Since big.Int can handle integers of arbitrary size, it is safe to multiply them without fear of under/overflow. Hence you do not need to prematurely cast to big.Float as you do here:

size, _ := new(big.Float).SetUint64(buf.Blocks).Mul(new(big.Float).SetUint64(buf.Blocks), new(big.Float).SetInt64(int64(buf.Bsize))).Float64()

This can be simplified to:

size, _ := new(big.Int).Mul(new(big.Int).SetUint64(buf.Blocks), new(big.Int).SetInt64(buf.Bsize)).Float64()

Keep things simple during the calculation, and only cast to float64 at the end for Prometheus.

dswarbrick avatar May 20 '24 20:05 dswarbrick

I see your point. Even though we explicitly .Float64() in both of the cases, the latter ensures that precision isn't lost owing to the premature type-casting to float64 (since big.Int isn't subject to rounding errors).

rexagod avatar May 21 '24 01:05 rexagod

Ah, I see that unix.Statfs_t.Bsize is either int32 or int64, depending on the host arch - so you will at least need to explicitly cast that to int64 for SetInt64() to be happy.

This brings me to my second point though - why does this PR only address 64-bit archs, i.e. bits.UintSize == 64? If this can overflow on 64-bit, surely it's just as likely (if not more likely) to overflow on 32-bit?

TBH, I'm not entirely convinced that the original issue #1672 was caused by an overflow. The method float64(buf.Blocks) * float64(buf.Bsize), despite some premature rounding when buf.Blocks exceeds the float64 mantissa, is not going to result in orders-of-magnitude different values when the available / free bytes are calculated in the same manner.

The original issue stated:

The size in bytes is 879510155264 which is accurate but the avail bytes is so much larger the scale makes size in bytes look near zero.

We only get a Grafana screenshot in the issue. It would have been a lot more helpful to have the raw metrics, so we could perhaps establish some relationship between the values, i.e. verify whether some value was in the ballpark of a wraparound. Even more helpful would have been the output of stat -f /tmp, as this would be as close as possible to the raw counters that the Go unix.Fstatfs was returning.

The fact remains that even a version as old as the one mentioned in the issue (v1.0.0-rc.0) used the float64(buf.Bavail) * float64(buf.Bsize) method, and to overflow this, it would require such a colossally large filesystem that I don't think the human race currently has the technology to achieve it. In fact, even if buf.Bsize was 2^63-1 and buf.Bavail was 2^64-1 (i.e., their maximum possible values according to unix.Statfs_t), the product would only 9.22e+18 - a long way short of MaxFloat64 (approx. 1.7e+308).

Since I cannot see how node_exporter could have produced a value approaching 8e+22 as shown in the Grafana screenshot, my gut feeling is that the promql must have included some additional multiplier, or perhaps the panel series unit was set incorrectly, resulting in Grafana performing multiplication by an additional factor.

Using big.Int to do the multiplication is of course going to preserve precision during the multiplication and only round (potentially) at the final float64 cast. But if that method is adopted, I think it should be used regardless of the host's Uintsize.

dswarbrick avatar May 21 '24 01:05 dswarbrick

I gave this patch another thought, and while we could get something in to address the possible, but not probable over-or-under-flow wraparound that could have caused this, or atleast what was my only conclusion from the data provided in the original issue, however unlikely, since we do see the raw metrics in the panel exhibiting values that are not at all expected, I'm convinced we should defer merging this in favor of being provided the requested information or some reproducibility that would help assess the exact cause of this issue with more confidence.

cc @treydock

rexagod avatar May 21 '24 07:05 rexagod