kepler kepler node_package does not equal total of kepler_process

Steps to reproduce on a Baremetal

deploy kepler
curl /metrics
grep for node_package_joules
grep for kepler_process_package_joules
sum the values of process
see if the total is different to node_package

Expected: there shouldn't be any significant difference Actual: The difference quite large and grows over time

Using Prometheus

kepler_node_package_joules_total{job="metal"} - on() sum(kepler_process_package_joules_total{job="metal"})

Nov 06 '24 02:11 sthaha

do you think this is related to #1833 ?

Nov 06 '24 08:11 marvin-steinke

@marvin-steinke , I don't think this is related but in relation to this bug it turns out this is an expected behaviour from kepler.

The explanation is that kepler_node_package_joules_total counter keeps track of the joules count from the time kepler is running while kepler_process_package_joules_total only tracks running processes (and not terminated ones). Thus it is expected to have node_package_joules_total > sum(kepler_process_package_joules_total)

So the right test is if sum(rate(kepler_node_package_joules_total[30s])) == sum(rate(kepler_process_package_joules_total[30s])). I.E. is the node's power in Watts equal to the watts allocated to processes. My tests show a round off error which can certainly be minimised.

I see that some times (when there is a spike in power use), kepler fails to allocate the power usage to all running processes correctly. As shown in this screenshot below

The red line is rate(kepler_node_package_joules_total) and the yellow line is sum(rate(kepler_process_package_joules_total)). These lines are supposed to be the same but they aren't. But in most cases, it tracks pretty well. I need to investigate further why happens to be case.

@rootfs any thoughts ?

Feb 20 '25 09:02 sthaha

kepler node_package does not equal total of kepler_process_package