kepler icon indicating copy to clipboard operation
kepler copied to clipboard

Memory leak

Open mdraijer opened this issue 7 months ago • 4 comments

What happened?

We're running Kepler on two Openshift 4.17 / Kubernetes 1.30 clusters, one small cluster with 13 nodes and 550 running pods and one larger cluster with 58 nodes and 3000 running pods.

Both clusters show the same behaviour in memory usage of the Kepler pods: constantly increasing.

Here you can see that it is increasing for days, only resetting at a restart of the pods (small clusters, total memory usage of all Kepler pods):

Image

What did you expect to happen?

Memory usage more or less stable, at least after some time.

How can we reproduce it (as minimally and precisely as possible)?

Deploy Kepler.

We have done minimal adjustments to the helm chart (v0.5.19). Here are our values:

  values:
    image:
      repository: ${harbor_registry}/quay-proxy/sustainable_computing_io/kepler
    tolerations:
      # Tolerations for the various taints on each cluster
    service:
      port: 9103
    serviceMonitor:
      enabled: true
      # Some settings for Prometheus scraping the metrics
    securityContext:
      privileged: false

Anything else we need to know?

No response

Kepler image tag

sustainable_computing_io/kepler:release-0.8.0

Kubernetes version

$ kubectl version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.10

Cloud provider or bare metal

Openshift 4.17.19

OS version

Host OS for the Kubernetes nodes is Red Hat Enterprise Linux CoreOS 417.94.202502251300-0

Install tools

Kepler deployment config

We have deployed using the Helm chart. Basic deployment, just the Daemonset, values see above.

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
<<NO CONFIGMAP>>

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}
<<NO KEPLER-EXPORTER>>

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

CRI-O 1.30.10

Related plugins (CNI, CSI, ...) and versions (if applicable)

mdraijer avatar May 20 '25 08:05 mdraijer

After 3 weeks running, memory usage is tripled.

mdraijer avatar Jun 02 '25 08:06 mdraijer

@mdraijer Thank you for the report. We have been working on a rewrite of kepler which is mostly feature complete and works on baremetal. The resource consumption is also comparatively low and stable in our tests.

You can find the releases under https://github.com/sustainable-computing-io/kepler/releases page - with -reboot tag https://github.com/sustainable-computing-io/kepler/releases?q=reboot&expanded=true

Images are published to Quay: https://quay.io/repository/sustainable_computing_io/kepler-reboot?tab=tags

Could you please give this version a go?

sthaha avatar Jun 19 '25 04:06 sthaha

@mdraijer this is really great to see - thank you for sharing this. It might solve the problem we have internally with running Kepler in some dev clusters: https://github.com/sustainable-computing-io/kepler/issues/2032

I will find time to try this change in the next couple of weeks and let you know 🤞

nikimanoledaki avatar Jun 25 '25 18:06 nikimanoledaki

@nikimanoledaki @mdraijer , in case you didn't know .. we also have kepler-operator https://github.com/sustainable-computing-io/kepler-operator which lets you easily deploy and configure kepler (reboot) through the power-monitor CRD.

You can also find out k8s manifest here: https://github.com/sustainable-computing-io/kepler/tree/reboot/manifests/k8s

sthaha avatar Jun 25 '25 23:06 sthaha