prom-client icon indicating copy to clipboard operation
prom-client copied to clipboard

feat: add eventloop utilization default metric

Open ivan-tymoshenko opened this issue 2 years ago • 16 comments

Close #506

ivan-tymoshenko avatar Oct 04 '22 13:10 ivan-tymoshenko

@Eomm, I find you as a contributor of fastify-elu-scaler. If you have some time and interested, PTAL.

ivan-tymoshenko avatar Oct 04 '22 13:10 ivan-tymoshenko

To measure ELU, you need to get the first utilization value, wait for some timeout, get the second utilization value and then count the "difference". The are at least two ways to implement this.

  1. Implementation that you can see in this PR. It's pretty straightforward. When you call the collect method, it starts the measurement, waits for a timeout, and returns the result. It is an async metric, which means you will have to wait for the timeout to get the result, and because all other default metrics are sync, it might be confusing if you set up a big timeout.

  2. You can find this implementation here. It gets the elu metric all the time in setInterval. That means I can get the last measurement in a sync way. But it adds some side effect behavior (timeout) and makes observations without an explicit call of the collect function.

I don't know which way is more acceptable from user perspective.

ivan-tymoshenko avatar Oct 05 '22 10:10 ivan-tymoshenko

Thank you both for the PR and review.

@trevnorris you wrote a nice blog post on this (and I think did a lot of the impl?), any recommendation whether we should use setInterval to measure continuously or setTimeout to only measure when probed? One possible difference is that on-probe will be biased by other prom-client collectors.

zbjornson avatar Oct 05 '22 16:10 zbjornson

@zbjornson I'd recommend using a setInterval. The ELU is meant to be recorded as regular time series data, the same as CPU.

trevnorris avatar Oct 10 '22 18:10 trevnorris

@zbjornson Does summary sliding window work for sum and count parameters? It looks like not. How I can count the average value for some time period?

ivan-tymoshenko avatar Oct 11 '22 09:10 ivan-tymoshenko

@zbjornson what's the status on getting this merged? thanks!

zeldrinn avatar Dec 05 '22 19:12 zeldrinn

Hey @zbjornson @ivan-tymoshenko is this PR blocked? Anything we can do to help get this through? 🙏🏽

johnytiago avatar Oct 16 '23 11:10 johnytiago

@glensc @SimenB Is there anything we can do to help push this over the finish line?

kibertoad avatar Jan 02 '24 14:01 kibertoad

I don't remember why it was blocked. I will take a look this week.

ivan-tymoshenko avatar Jan 02 '24 17:01 ivan-tymoshenko

@trevnorris @glensc @SimenB @johnytiago Please take a look if this implementation makes sence and works for you.

ivan-tymoshenko avatar Jan 10 '24 13:01 ivan-tymoshenko

Hey, looking forward to seeing this merged! 🚀

I've been running my own implementation of this metric, using an interval that updates a guage, with a period of 1 second. You can see this chart is still very noisy, so IMO your default period of 100ms might be too small.

Screenshot 2024-01-12 at 2 23 38 PM

And also, this code I think could give incorrect data:

  • It could sample at the wrong frequency, if timeMs / intervalTimeout is stable but not a whole number. E.g. if it's consistently 1.4 it would sample too slowly and if it's consistently 1.6 it would sample too often.
  • It could mistakenly report a small utilization value multiple times if a long blocking event happens near the end of an interval period. E.g. if it's idle for 99ms, then blocks for 101ms, you'd report [0.505, 0.505] when you should report [0.01, 1]

I'd appreciate it if you could also expose a guage version of this metric as well as the histogram, like how eventLoopLag does.

const blockedIntervalsNumber = Math.round(timeMs / intervalTimeout);
for (let i = 0; i < blockedIntervalsNumber; i++) {
	summary.observe(value);
	histogram.observe(value);
}

simon-paris avatar Jan 12 '24 05:01 simon-paris

I've been running my own implementation of this metric, using an interval that updates a guage, with a period of 1 second. You can see this chart is still very noisy, so IMO your default period of 100ms might be too small.

I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here. https://nodesource.com/blog/event-loop-utilization-nodejs/ https://github.com/nearform/fastify-elu-scaler/blob/42efb4bf84ed1ffe7373e5f0d1ede7c92c2a7683/plugins/elu.js#L26

It could sample at the wrong frequency, if timeMs / intervalTimeout is stable but not a whole number. E.g. if it's consistently 1.4 it would sample too slowly and if it's consistently 1.6 it would sample too often.

The only posible situattion when timeMs > intervalTimeout is when event loop completelly blocked for timeMs and timeMs is bigger than intervalTimeout. In this case elu metrics equal 1. Of cource rounding this value is an approximation, but I simply don't see a better way to cover this case.

I'd appreciate it if you could also expose a guage version of this metric as well as the histogram, like how eventLoopLag does.

The main question for me here is when we should start and stop measuring elu in this case.

ivan-tymoshenko avatar Jan 12 '24 13:01 ivan-tymoshenko

I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here.

You're right, I tried it out at 100ms and it looks good.

The only posible situattion when timeMs > intervalTimeout is when event loop completelly blocked for timeMs and timeMs is bigger than intervalTimeout. In this case elu metrics equal 1.

Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.

setInterval(() => {
  setTimeout(() => {
    const t1 = Date.now();
    while (Date.now() < t1 + 110) {}
  }, 50);
}, 100);

simon-paris avatar Jan 15 '24 05:01 simon-paris

Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.

I understand. If you have a suggestion on how to measure elu in a more accurate way, you are welcome.

ivan-tymoshenko avatar Jan 15 '24 11:01 ivan-tymoshenko

@ivan-tymoshenko is this mostly complete and just need conflicts resolved, or something is still missing?

kibertoad avatar Apr 02 '24 07:04 kibertoad

@ivan-tymoshenko Just want to make sure it's understood, ELU as measured by Node isn't approximated. I patched libuv so it tracks it down to the system call. The interval length you choose to get the ELU won't change what it returns. It's always precise.

trevnorris avatar Apr 19 '24 23:04 trevnorris