kepler node_package does not equal total of kepler_process_package
Steps to reproduce on a Baremetal
- deploy kepler
- curl
/metrics - grep for
node_package_joules - grep for kepler_process_package_joules
- sum the values of process
- see if the total is different to node_package
Expected: there shouldn't be any significant difference Actual: The difference quite large and grows over time
Using Prometheus
kepler_node_package_joules_total{job="metal"} - on() sum(kepler_process_package_joules_total{job="metal"})
do you think this is related to #1833 ?
@marvin-steinke , I don't think this is related but in relation to this bug it turns out this is an expected behaviour from kepler.
The explanation is that kepler_node_package_joules_total counter keeps track of the joules count from the time kepler is running while kepler_process_package_joules_total only tracks running processes (and not terminated ones). Thus it is expected to have node_package_joules_total > sum(kepler_process_package_joules_total)
So the right test is if sum(rate(kepler_node_package_joules_total[30s])) == sum(rate(kepler_process_package_joules_total[30s])). I.E. is the node's power in Watts equal to the watts allocated to processes. My tests show a round off error which can certainly be minimised.
I see that some times (when there is a spike in power use), kepler fails to allocate the power usage to all running processes correctly. As shown in this screenshot below
The red line is rate(kepler_node_package_joules_total) and the yellow line is sum(rate(kepler_process_package_joules_total)). These lines are supposed to be the same but they aren't. But in most cases, it tracks pretty well. I need to investigate further why happens to be case.
@rootfs any thoughts ?