Memory leak
What happened?
We're running Kepler on two Openshift 4.17 / Kubernetes 1.30 clusters, one small cluster with 13 nodes and 550 running pods and one larger cluster with 58 nodes and 3000 running pods.
Both clusters show the same behaviour in memory usage of the Kepler pods: constantly increasing.
Here you can see that it is increasing for days, only resetting at a restart of the pods (small clusters, total memory usage of all Kepler pods):
What did you expect to happen?
Memory usage more or less stable, at least after some time.
How can we reproduce it (as minimally and precisely as possible)?
Deploy Kepler.
We have done minimal adjustments to the helm chart (v0.5.19). Here are our values:
values:
image:
repository: ${harbor_registry}/quay-proxy/sustainable_computing_io/kepler
tolerations:
# Tolerations for the various taints on each cluster
service:
port: 9103
serviceMonitor:
enabled: true
# Some settings for Prometheus scraping the metrics
securityContext:
privileged: false
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
$ kubectl version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.10
Cloud provider or bare metal
OS version
Host OS for the Kubernetes nodes is Red Hat Enterprise Linux CoreOS 417.94.202502251300-0
Install tools
Kepler deployment config
We have deployed using the Helm chart. Basic deployment, just the Daemonset, values see above.
For on kubernetes:
$ KEPLER_NAMESPACE=kepler
# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
<<NO CONFIGMAP>>
# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}
<<NO KEPLER-EXPORTER>>
For standalone:
put your Kepler command argument here
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
After 3 weeks running, memory usage is tripled.
@mdraijer Thank you for the report. We have been working on a rewrite of kepler which is mostly feature complete and works on baremetal. The resource consumption is also comparatively low and stable in our tests.
You can find the releases under https://github.com/sustainable-computing-io/kepler/releases page - with -reboot tag https://github.com/sustainable-computing-io/kepler/releases?q=reboot&expanded=true
Images are published to Quay: https://quay.io/repository/sustainable_computing_io/kepler-reboot?tab=tags
Could you please give this version a go?
@mdraijer this is really great to see - thank you for sharing this. It might solve the problem we have internally with running Kepler in some dev clusters: https://github.com/sustainable-computing-io/kepler/issues/2032
I will find time to try this change in the next couple of weeks and let you know 🤞
@nikimanoledaki @mdraijer , in case you didn't know .. we also have kepler-operator https://github.com/sustainable-computing-io/kepler-operator which lets you easily deploy and configure kepler (reboot) through the power-monitor CRD.
You can also find out k8s manifest here: https://github.com/sustainable-computing-io/kepler/tree/reboot/manifests/k8s