kepler
kepler copied to clipboard
dial error: dial unix /tmp/estimator.sock: connect: no such file or directory
Describe the bug After rolling over the daemonset to the latest image on quay.io registry (sha256:01a86339a8acb566ddcee848640ed4419ad0bffac98529e9b489a3dcb1e671f5) there is the message from title being shown constantly. Example output of the problem:
2022/08/25 12:30:53 Kubelet Read: map[<pod-list-trimmed>]
2022/08/25 12:30:53 dial error: dial unix /tmp/estimator.sock: connect: no such file or directory
energy from pod (0 processes): name: <some-pod> namespace: <some-namespace>
Is the estimator.sock expected to be missing in current state of the project?
Each node is reporting the same error. As a sidenote, since then nodes are not logging any new kepler metrics to Prometheus. I am in no place to suggest that these are connected issues and the missing metrics might be some other local issue, but there's that.
To Reproduce Steps to reproduce the behavior:
- Run kepler on OpenShift 4.11
- Check kepler-exporter container logs for presence of '/tmp/estimator.sock: connect: no such file or directory'
Expected behavior /tmp/estimator.sock error is not reported.
Desktop (please complete the following information):
- OS: RedHat CoreOS 4.11
cc @sunya-ch
I just verified that rolling over to previously working image (sha256:4ad0c2f56538c383f1b3a90ccc756fcf937f0a436fafb88a23b9a780164f7be9) gets the metric gathering process working again, thus I think that it is not a local issue.
thank you for confirming this @Feelas!
The estimator socket feature is a work-in-progress. We'll test this out more thoroughly and keep you posted
@Feelas the message dial unix /tmp/estimator.sock: connect: no such file or directory
is benign. The short story is that, the estimator sidecar is not yet started (this is being worked on in #104 and the estimator repo). Upcoming PRs will start up the estimator sidecar and create the sock.
Thanks for testing!
Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?
@Feelas The metrics are still logged.
But if you want to give it a spin, please run the previous (aka latest) kepler image, and patch the deployment to kick off the estimator, that'll make the warning message go away.
kubectl patch -n monitoring daemonset kepler-exporter --patch-file https://raw.githubusercontent.com/sustainable-computing-io/kepler-estimator/main/deploy/patch.yaml
(based on this instruction )
Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?
Only dynamic power that will be always exported as 0 (not estimated). The hardware counters, cgroups and pod package power are expected to be exported to prometheus if they are available. Otherwise, we have to investigate the issue.
I suggest to use the updated version because the previous one collects cgroup metric in a wrong way and the overflow issue is not handled on some metrics.
Please confirm the following points
- Pod's metrics reported in the kepler log with detected pod name/namespace?
- Names of prometheus metric
- pod_energy_stat
- pod_<curr|total>energy_in<core|dram|uncore|gpu|other|pkg>_millijoule
Note: the grafana dashboard should be updated.
- Prometheus address parsed to the Kepler command (--address 0.0.0.0:9102)
Hello sunya-ch and thanks for giving this some attention!
- Pod's metrics are correctly reported with expected pods from expected nodes
- I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent
- Prometheus address is set to 0.0.0.0:9102
The "grafana dashboard should be updated" note is what it boils down to, I think. I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.
Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?
If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)
@Feelas thanks for the detailed test! If you can submit a PR on the grafana name change, that'll be great.
Hello sunya-ch and thanks for giving this some attention!
- Pod's metrics are correctly reported with expected pods from expected nodes
- I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent
- Prometheus address is set to 0.0.0.0:9102
The "grafana dashboard should be updated" note is what it boils down to, I think. I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.
Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?
If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)
Thank you so much for your kind testing. It's very good to see the expected behaviour there :)
And yes, your understanding is perfectly correct 👍
Good to hear then : )
Going back to issue topic, the estimator.sock issue has been well explained (as WIP & expected) and I think we can close this to not keep this ticket unnecessarily open.
sounds great @Feelas
keep this issue open till the dashboard and metrics are consistent
Not going to commit right now since I don't know whether there will be time to address this, but will look into it.
Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.
Not going to commit right now since I don't know whether there will be time to address this, but will look into it.
Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.
Thank you for pointing this out.
pkg energy will include cpu, dram, and uncore which are reported by RAPL.
pod package energy computed from RAPL package power is pod_<curr|total>_energy_in_pkg_millijoule
However, we might add another metric that is package energy + GPU energy + the other part (node - package - GPU if node energy is available) to replace the pod_energy_total. Currently, this value is reported as a value of pod_energy_stat. Should we separate it as a new metric? btw, I didn't put this metric at the first place because it could be an inconsistent metric between the system that node energy is available and the system that does not have it.
https://github.com/sustainable-computing-io/kepler/blob/267dd31b2ac953bee5ee8e88bf1541fab5afe34a/pkg/collector/collector.go#L285
So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?
So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?
I am not sure about the original purpose of the pod_energy_total and pod_energy_current but if it includes just only the energy from package (mainly core+dram), my answer is yes.
Taking a look at commit de3584a9c3a754b17762b08e9494950f7c3b14d3 it seem to historically be calculated as core+dram.
the dashboard picks up the latest metrics name now.