prom-client
prom-client copied to clipboard
feat: add eventloop utilization default metric
Close #506
@Eomm, I find you as a contributor of fastify-elu-scaler. If you have some time and interested, PTAL.
To measure ELU, you need to get the first utilization value, wait for some timeout, get the second utilization value and then count the "difference". The are at least two ways to implement this.
-
Implementation that you can see in this PR. It's pretty straightforward. When you call the
collect
method, it starts the measurement, waits for a timeout, and returns the result. It is an async metric, which means you will have to wait for the timeout to get the result, and because all other default metrics are sync, it might be confusing if you set up a big timeout. -
You can find this implementation here. It gets the elu metric all the time in
setInterval
. That means I can get the last measurement in a sync way. But it adds some side effect behavior (timeout) and makes observations without an explicit call of thecollect
function.
I don't know which way is more acceptable from user perspective.
Thank you both for the PR and review.
@trevnorris you wrote a nice blog post on this (and I think did a lot of the impl?), any recommendation whether we should use setInterval
to measure continuously or setTimeout
to only measure when probed? One possible difference is that on-probe will be biased by other prom-client collectors.
@zbjornson I'd recommend using a setInterval
. The ELU is meant to be recorded as regular time series data, the same as CPU.
@zbjornson Does summary sliding window work for sum and count parameters? It looks like not. How I can count the average value for some time period?
@zbjornson what's the status on getting this merged? thanks!
Hey @zbjornson @ivan-tymoshenko is this PR blocked? Anything we can do to help get this through? 🙏🏽
@glensc @SimenB Is there anything we can do to help push this over the finish line?
I don't remember why it was blocked. I will take a look this week.
@trevnorris @glensc @SimenB @johnytiago Please take a look if this implementation makes sence and works for you.
Hey, looking forward to seeing this merged! 🚀
I've been running my own implementation of this metric, using an interval that updates a guage, with a period of 1 second. You can see this chart is still very noisy, so IMO your default period of 100ms might be too small.
And also, this code I think could give incorrect data:
- It could sample at the wrong frequency, if
timeMs / intervalTimeout
is stable but not a whole number. E.g. if it's consistently 1.4 it would sample too slowly and if it's consistently 1.6 it would sample too often. - It could mistakenly report a small utilization value multiple times if a long blocking event happens near the end of an interval period. E.g. if it's idle for 99ms, then blocks for 101ms, you'd report [0.505, 0.505] when you should report [0.01, 1]
I'd appreciate it if you could also expose a guage version of this metric as well as the histogram, like how eventLoopLag does.
const blockedIntervalsNumber = Math.round(timeMs / intervalTimeout);
for (let i = 0; i < blockedIntervalsNumber; i++) {
summary.observe(value);
histogram.observe(value);
}
I've been running my own implementation of this metric, using an interval that updates a guage, with a period of 1 second. You can see this chart is still very noisy, so IMO your default period of 100ms might be too small.
I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here. https://nodesource.com/blog/event-loop-utilization-nodejs/ https://github.com/nearform/fastify-elu-scaler/blob/42efb4bf84ed1ffe7373e5f0d1ede7c92c2a7683/plugins/elu.js#L26
It could sample at the wrong frequency, if timeMs / intervalTimeout is stable but not a whole number. E.g. if it's consistently 1.4 it would sample too slowly and if it's consistently 1.6 it would sample too often.
The only posible situattion when timeMs > intervalTimeout
is when event loop completelly blocked for timeMs
and timeMs
is bigger than intervalTimeout
. In this case elu
metrics equal 1
. Of cource rounding this value is an approximation, but I simply don't see a better way to cover this case.
I'd appreciate it if you could also expose a guage version of this metric as well as the histogram, like how eventLoopLag does.
The main question for me here is when we should start and stop measuring elu in this case.
I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here.
You're right, I tried it out at 100ms and it looks good.
The only posible situattion when timeMs > intervalTimeout is when event loop completelly blocked for timeMs and timeMs is bigger than intervalTimeout. In this case elu metrics equal 1.
Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.
setInterval(() => {
setTimeout(() => {
const t1 = Date.now();
while (Date.now() < t1 + 110) {}
}, 50);
}, 100);
Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.
I understand. If you have a suggestion on how to measure elu in a more accurate way, you are welcome.
@ivan-tymoshenko is this mostly complete and just need conflicts resolved, or something is still missing?
@ivan-tymoshenko Just want to make sure it's understood, ELU as measured by Node isn't approximated. I patched libuv so it tracks it down to the system call. The interval length you choose to get the ELU won't change what it returns. It's always precise.