kepler icon indicating copy to clipboard operation
kepler copied to clipboard

kepler node_package does not equal total of kepler_process_package

Open sthaha opened this issue 1 year ago • 2 comments

Steps to reproduce on a Baremetal

  • deploy kepler
  • curl /metrics
  • grep for node_package_joules
  • grep for kepler_process_package_joules
  • sum the values of process
  • see if the total is different to node_package

Expected: there shouldn't be any significant difference Actual: The difference quite large and grows over time

Using Prometheus

  • kepler_node_package_joules_total{job="metal"} - on() sum(kepler_process_package_joules_total{job="metal"})

image

sthaha avatar Nov 06 '24 02:11 sthaha

do you think this is related to #1833 ?

marvin-steinke avatar Nov 06 '24 08:11 marvin-steinke

@marvin-steinke , I don't think this is related but in relation to this bug it turns out this is an expected behaviour from kepler.

The explanation is that kepler_node_package_joules_total counter keeps track of the joules count from the time kepler is running while kepler_process_package_joules_total only tracks running processes (and not terminated ones). Thus it is expected to have node_package_joules_total > sum(kepler_process_package_joules_total)

So the right test is if sum(rate(kepler_node_package_joules_total[30s])) == sum(rate(kepler_process_package_joules_total[30s])). I.E. is the node's power in Watts equal to the watts allocated to processes. My tests show a round off error which can certainly be minimised.

I see that some times (when there is a spike in power use), kepler fails to allocate the power usage to all running processes correctly. As shown in this screenshot below

Image

The red line is rate(kepler_node_package_joules_total) and the yellow line is sum(rate(kepler_process_package_joules_total)). These lines are supposed to be the same but they aren't. But in most cases, it tracks pretty well. I need to investigate further why happens to be case.

Image

@rootfs any thoughts ?

sthaha avatar Feb 20 '25 09:02 sthaha