elbencho icon indicating copy to clipboard operation
elbencho copied to clipboard

Value in latency histogram is larger than latency max value, wrong computation or wrong interpretation?

Open git4ghw opened this issue 2 years ago • 1 comments

For some latency values, the max value is lower than the maximum value in the histogram:

File latency = 5.30 ms max value in Files lat hist = 5.792 ms

The same is valid for the IO latency values: max IO latency = 196 us max value in IO lat hist = 216 us

Do I misinterpret the values, or is there something wrong with the computation of the values?

      Files latency    : [ min=405us avg=2.10ms max=5.30ms ]
      Files lat % us   : [ 1%<=430 50%<=862 75%<=4096 99%<=5792 ]
      Files lat hist   : [ 430: 1, 512: 1, 608: 1, 724: 1, 862: 1, 1024: 1, 4096: 2, 4870: 1, 5792: 1 ]
      IO latency       : [ min=6us avg=24us max=196us ]
      IO lat % us      : [ 1%<=6.8 50%<=12 75%<=22 99%<=216 ]
      IO lat hist      : [ 6.8: 9, 8.0: 21, 9.6: 14, 12: 18, 14: 5, 16: 5, 20: 2, 22: 3, 26: 6, 46: 1, 64: 2, 76: 6, 90: 1, 108: 4, 128: 1, 216: 2 ]

Assuming that the history values are really measured values, this would mean that some or all max values are wrong and that design decisions based on the max value would not be correct, since latency based long-distance architectures could be statistically significantly wrong, even if measured with a very high number of measurements. In the IO values (with a very small lot), the measurement error is ~ 10%

git4ghw avatar May 16 '23 05:05 git4ghw

Thanks for reporting this, @git4ghw . This makes me aware that the background for this is not explained in the built-in help, so at the very least I need to update the help.

Generally, the latency histogram uses "buckets" (each bucket representing a latency range) to count the number of IOs that were in a certain range. Not knowing upfront how fast or slow the tested system is, the histogram has to cover a fairly wide range. Thus, with a high resolution across the whole range, there would be too many buckets. That's why the implemented approach is based on broader decreasing accuracy with higher latency. The bucket size is calculated as 2^n microseconds, where n starts as 0.25 and is increased by 0.25 for each higher latency bucket.

In your example, the max latency was 5300 microseconds. log2(5300) is 12.37. This means it got into the histogram bucket that holds the range 2^12.25 - 2^12.5, or in other words the bucket that holds the range 4.87ms to 5.79ms.

However, I'm not very happy with this approach, because it's rather inconvenient for humans to calculate the ranges (which could be addressed by explicitly mentioning the ranges in the output, but then the histogram output line would get correspondingly longer) or I could think about a new approach tries to meet your suggestion of not getting beyond 10% inaccuracy.

And of course I'm open to suggestions if you want to make any.

breuner avatar May 18 '23 21:05 breuner