kepler dial error: dial unix /tmp/estimator.sock: connect: no such file or directory

Describe the bug After rolling over the daemonset to the latest image on quay.io registry (sha256:01a86339a8acb566ddcee848640ed4419ad0bffac98529e9b489a3dcb1e671f5) there is the message from title being shown constantly. Example output of the problem:

2022/08/25 12:30:53 Kubelet Read: map[<pod-list-trimmed>]
2022/08/25 12:30:53 dial error: dial unix /tmp/estimator.sock: connect: no such file or directory
energy from pod (0 processes): name: <some-pod> namespace: <some-namespace>

Is the estimator.sock expected to be missing in current state of the project?

Each node is reporting the same error. As a sidenote, since then nodes are not logging any new kepler metrics to Prometheus. I am in no place to suggest that these are connected issues and the missing metrics might be some other local issue, but there's that.

To Reproduce Steps to reproduce the behavior:

Run kepler on OpenShift 4.11
Check kepler-exporter container logs for presence of '/tmp/estimator.sock: connect: no such file or directory'

Expected behavior /tmp/estimator.sock error is not reported.

Desktop (please complete the following information):

OS: RedHat CoreOS 4.11

Aug 25 '22 12:08 Feelas

cc @sunya-ch

Aug 25 '22 13:08 rootfs

I just verified that rolling over to previously working image (sha256:4ad0c2f56538c383f1b3a90ccc756fcf937f0a436fafb88a23b9a780164f7be9) gets the metric gathering process working again, thus I think that it is not a local issue.

Aug 25 '22 13:08 Feelas

thank you for confirming this @Feelas!

The estimator socket feature is a work-in-progress. We'll test this out more thoroughly and keep you posted

Aug 25 '22 13:08 rootfs

@Feelas the message dial unix /tmp/estimator.sock: connect: no such file or directory is benign. The short story is that, the estimator sidecar is not yet started (this is being worked on in #104 and the estimator repo). Upcoming PRs will start up the estimator sidecar and create the sock.

Thanks for testing!

Aug 25 '22 13:08 rootfs

Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?

Aug 25 '22 13:08 Feelas

@Feelas The metrics are still logged.

But if you want to give it a spin, please run the previous (aka latest) kepler image, and patch the deployment to kick off the estimator, that'll make the warning message go away.

kubectl patch -n monitoring daemonset kepler-exporter --patch-file https://raw.githubusercontent.com/sustainable-computing-io/kepler-estimator/main/deploy/patch.yaml

(based on this instruction )

Aug 25 '22 13:08 rootfs

Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?

Only dynamic power that will be always exported as 0 (not estimated). The hardware counters, cgroups and pod package power are expected to be exported to prometheus if they are available. Otherwise, we have to investigate the issue.

I suggest to use the updated version because the previous one collects cgroup metric in a wrong way and the overflow issue is not handled on some metrics.

Please confirm the following points

Pod's metrics reported in the kepler log with detected pod name/namespace?
Names of prometheus metric
- pod_energy_stat
- pod_<curr|total>energy_in<core|dram|uncore|gpu|other|pkg>_millijoule
Note: the grafana dashboard should be updated.
Prometheus address parsed to the Kepler command (--address 0.0.0.0:9102)

Aug 25 '22 13:08 sunya-ch

Hello sunya-ch and thanks for giving this some attention!

Pod's metrics are correctly reported with expected pods from expected nodes
I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent
Prometheus address is set to 0.0.0.0:9102

The "grafana dashboard should be updated" note is what it boils down to, I think. I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.

Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?

If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)

Aug 25 '22 14:08 Feelas

@Feelas thanks for the detailed test! If you can submit a PR on the grafana name change, that'll be great.

Aug 25 '22 15:08 rootfs

Hello sunya-ch and thanks for giving this some attention!

Pod's metrics are correctly reported with expected pods from expected nodes

I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent

Prometheus address is set to 0.0.0.0:9102

The "grafana dashboard should be updated" note is what it boils down to, I think. I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.

Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?

If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)

Thank you so much for your kind testing. It's very good to see the expected behaviour there :)

And yes, your understanding is perfectly correct 👍

Aug 25 '22 15:08 sunya-ch

Good to hear then : )

Going back to issue topic, the estimator.sock issue has been well explained (as WIP & expected) and I think we can close this to not keep this ticket unnecessarily open.

Aug 26 '22 09:08 Feelas

sounds great @Feelas

Aug 26 '22 11:08 rootfs

keep this issue open till the dashboard and metrics are consistent

Aug 26 '22 14:08 rootfs

Not going to commit right now since I don't know whether there will be time to address this, but will look into it.

Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.

Aug 29 '22 08:08 Feelas

Not going to commit right now since I don't know whether there will be time to address this, but will look into it.

Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.

Thank you for pointing this out. pkg energy will include cpu, dram, and uncore which are reported by RAPL. pod package energy computed from RAPL package power is pod_<curr|total>_energy_in_pkg_millijoule

However, we might add another metric that is package energy + GPU energy + the other part (node - package - GPU if node energy is available) to replace the pod_energy_total. Currently, this value is reported as a value of pod_energy_stat. Should we separate it as a new metric? btw, I didn't put this metric at the first place because it could be an inconsistent metric between the system that node energy is available and the system that does not have it.

https://github.com/sustainable-computing-io/kepler/blob/267dd31b2ac953bee5ee8e88bf1541fab5afe34a/pkg/collector/collector.go#L285

Aug 29 '22 10:08 sunya-ch

So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?

Aug 29 '22 12:08 Feelas

So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?

I am not sure about the original purpose of the pod_energy_total and pod_energy_current but if it includes just only the energy from package (mainly core+dram), my answer is yes.

Aug 29 '22 12:08 sunya-ch

Taking a look at commit de3584a9c3a754b17762b08e9494950f7c3b14d3 it seem to historically be calculated as core+dram.

Aug 29 '22 12:08 Feelas

the dashboard picks up the latest metrics name now.

Aug 29 '22 19:08 rootfs

kepler kepler copied to clipboard

dial error: dial unix /tmp/estimator.sock: connect: no such file or directory

kepler
kepler copied to clipboard